diff -Nru sklearn-pandas-1.8.0/debian/changelog sklearn-pandas-2.0.3/debian/changelog --- sklearn-pandas-1.8.0/debian/changelog 2019-06-23 12:57:43.000000000 +0000 +++ sklearn-pandas-2.0.3/debian/changelog 2020-11-27 16:39:49.000000000 +0000 @@ -1,3 +1,15 @@ +sklearn-pandas (2.0.3-1) unstable; urgency=medium + + * New upstream release + + -- Federico Ceratto Fri, 27 Nov 2020 16:39:49 +0000 + +sklearn-pandas (1.8.0-2) unstable; urgency=medium + + * New build + + -- Federico Ceratto Sat, 09 May 2020 11:10:48 +0100 + sklearn-pandas (1.8.0-1) unstable; urgency=medium [ Federico Ceratto ] diff -Nru sklearn-pandas-1.8.0/debian/control sklearn-pandas-2.0.3/debian/control --- sklearn-pandas-1.8.0/debian/control 2019-06-23 12:57:43.000000000 +0000 +++ sklearn-pandas-2.0.3/debian/control 2020-11-27 16:39:49.000000000 +0000 @@ -3,7 +3,7 @@ Uploaders: Federico Ceratto Section: python Priority: optional -Build-Depends: debhelper-compat (= 12) +Build-Depends: debhelper-compat (= 13) Build-Depends-Indep: dh-python, python3-all, @@ -13,10 +13,11 @@ python3-numpy, python3-pandas, python3-sklearn, -Standards-Version: 4.3.1 +Standards-Version: 4.5.0 Homepage: https://github.com/paulgb/sklearn-pandas Vcs-Git: https://salsa.debian.org/debian/sklearn-pandas.git Vcs-Browser: https://salsa.debian.org/debian/sklearn-pandas +Rules-Requires-Root: no Package: python3-sklearn-pandas Architecture: all diff -Nru sklearn-pandas-1.8.0/debian/gitlab-ci.yml sklearn-pandas-2.0.3/debian/gitlab-ci.yml --- sklearn-pandas-1.8.0/debian/gitlab-ci.yml 2019-06-23 12:57:43.000000000 +0000 +++ sklearn-pandas-2.0.3/debian/gitlab-ci.yml 2020-11-27 16:39:49.000000000 +0000 @@ -1,10 +1,4 @@ -image: registry.gitlab.com/eighthave/ci-image-git-buildpackage:latest - -pages: - stage: deploy - artifacts: - paths: - - "*.deb" - script: - - gitlab-ci-git-buildpackage-all - - gitlab-ci-aptly +--- +include: + - https://salsa.debian.org/salsa-ci-team/pipeline/raw/master/salsa-ci.yml + - https://salsa.debian.org/salsa-ci-team/pipeline/raw/master/pipeline-jobs.yml diff -Nru sklearn-pandas-1.8.0/debian/upstream/metadata sklearn-pandas-2.0.3/debian/upstream/metadata --- sklearn-pandas-1.8.0/debian/upstream/metadata 1970-01-01 00:00:00.000000000 +0000 +++ sklearn-pandas-2.0.3/debian/upstream/metadata 2020-11-27 16:39:49.000000000 +0000 @@ -0,0 +1,3 @@ +--- +Bug-Database: https://github.com/scikit-learn-contrib/sklearn-pandas/issues +Bug-Submit: https://github.com/scikit-learn-contrib/sklearn-pandas/issues/new diff -Nru sklearn-pandas-1.8.0/debian/watch sklearn-pandas-2.0.3/debian/watch --- sklearn-pandas-1.8.0/debian/watch 2019-06-23 12:57:43.000000000 +0000 +++ sklearn-pandas-2.0.3/debian/watch 2020-11-27 16:39:49.000000000 +0000 @@ -1,4 +1,2 @@ -# please also check https://pypi.debian.net/sklearn-pandas/watch -version=3 -opts=uversionmangle=s/(rc|a|b|c)/~$1/ \ -https://pypi.debian.net/sklearn-pandas/sklearn-pandas-(.+)\.(?:zip|tgz|tbz|txz|(?:tar\.(?:gz|bz2|xz))) \ No newline at end of file +version=4 +opts=uversionmangle=s/(rc|a|b|c)/~$1/ https://pypi.debian.net/sklearn-pandas sklearn-pandas-(.+)\.(?:zip|tgz|tbz|txz|(?:tar\.(?:gz|bz2|xz))) diff -Nru sklearn-pandas-1.8.0/PKG-INFO sklearn-pandas-2.0.3/PKG-INFO --- sklearn-pandas-1.8.0/PKG-INFO 2018-12-01 19:14:57.000000000 +0000 +++ sklearn-pandas-2.0.3/PKG-INFO 2020-11-27 13:01:14.000000000 +0000 @@ -1,12 +1,11 @@ -Metadata-Version: 1.0 +Metadata-Version: 1.2 Name: sklearn-pandas -Version: 1.8.0 +Version: 2.0.3 Summary: Pandas integration with sklearn -Home-page: https://github.com/paulgb/sklearn-pandas -Author: Israel Saeta Pérez -Author-email: israel.saeta@dukebody.com +Home-page: https://github.com/scikit-learn-contrib/sklearn-pandas +Maintainer: Ritesh Agrawal +Maintainer-email: ragrawal@gmail.com License: UNKNOWN -Description-Content-Type: UNKNOWN Description: UNKNOWN Keywords: scikit,sklearn,pandas Platform: UNKNOWN diff -Nru sklearn-pandas-1.8.0/README.rst sklearn-pandas-2.0.3/README.rst --- sklearn-pandas-1.8.0/README.rst 2018-12-01 19:13:37.000000000 +0000 +++ sklearn-pandas-2.0.3/README.rst 2020-11-27 13:01:14.000000000 +0000 @@ -2,16 +2,11 @@ Sklearn-pandas ============== -.. image:: https://circleci.com/gh/pandas-dev/sklearn-pandas.svg?style=svg - :target: https://circleci.com/gh/pandas-dev/sklearn-pandas +.. image:: https://circleci.com/gh/scikit-learn-contrib/sklearn-pandas.svg?style=svg + :target: https://circleci.com/gh/scikit-learn-contrib/sklearn-pandas This module provides a bridge between `Scikit-Learn `__'s machine learning methods and `pandas `__-style Data Frames. - -In particular, it provides: - -1. A way to map ``DataFrame`` columns to transformations, which are later recombined into features. -2. A compatibility shim for old ``scikit-learn`` versions to cross-validate a pipeline that takes a pandas ``DataFrame`` as input. This is only needed for ``scikit-learn<0.16.0`` (see `#11 `__ for details). It is deprecated and will likely be dropped in ``skearn-pandas==2.0``. -3. A couple of special transformers that work well with pandas inputs: ``CategoricalImputer`` and ``FunctionTransformer`.` +In particular, it provides a way to map ``DataFrame`` columns to transformations, which are later recombined into features. Installation ------------ @@ -20,6 +15,7 @@ # pip install sklearn-pandas + Tests ----- @@ -36,11 +32,11 @@ Import what you need from the ``sklearn_pandas`` package. The choices are: * ``DataFrameMapper``, a class for mapping pandas data frame columns to different sklearn transformations -* ``cross_val_score``, similar to ``sklearn.cross_validation.cross_val_score`` but working on pandas DataFrames + For this demonstration, we will import both:: - >>> from sklearn_pandas import DataFrameMapper, cross_val_score + >>> from sklearn_pandas import DataFrameMapper For these examples, we'll also use pandas, numpy, and sklearn:: @@ -136,6 +132,16 @@ >>> mapper_alias.transformed_names_ ['children_scaled'] +Alternatively, you can also specify prefix and/or suffix to add to the column name. For example:: + + + >>> mapper_alias = DataFrameMapper([ + ... (['children'], sklearn.preprocessing.StandardScaler(), {'prefix': 'standard_scaled_'}), + ... (['children'], sklearn.preprocessing.StandardScaler(), {'suffix': '_raw'}) + ... ]) + >>> _ = mapper_alias.fit_transform(data.copy()) + >>> mapper_alias.transformed_names_ + ['standard_scaled_children', 'children_raw'] Passing Series/DataFrames to the transformers ********************************************* @@ -204,6 +210,32 @@ Note this does not work together with the ``default=True`` or ``sparse=True`` arguments to the mapper. +Dropping columns explictly +******************************* + +Sometimes it is required to drop a specific column/ list of columns. +For this purpose, ``drop_cols`` argument for ``DataFrameMapper`` can be used. +Default value is ``None`` + + >>> mapper_df = DataFrameMapper([ + ... ('pet', sklearn.preprocessing.LabelBinarizer()), + ... (['children'], sklearn.preprocessing.StandardScaler()) + ... ], drop_cols=['salary']) + +Now running ``fit_transform`` will run transformations on 'pet' and 'children' and drop 'salary' column: + + >>> np.round(mapper_df.fit_transform(data.copy()), 1) + array([[ 1. , 0. , 0. , 0.2], + [ 0. , 1. , 0. , 1.9], + [ 0. , 1. , 0. , -0.6], + [ 0. , 0. , 1. , -0.6], + [ 1. , 0. , 0. , -1.5], + [ 0. , 1. , 0. , -0.6], + [ 1. , 0. , 0. , 1. ], + [ 0. , 0. , 1. , 0.2]]) + +Transformations may require multiple input columns. In these + Transform Multiple Columns ************************** @@ -231,8 +263,9 @@ Multiple transformers can be applied to the same column specifying them in a list:: + >>> from sklearn.impute import SimpleImputer >>> mapper3 = DataFrameMapper([ - ... (['age'], [sklearn.preprocessing.Imputer(), + ... (['age'], [SimpleImputer(), ... sklearn.preprocessing.StandardScaler()])]) >>> data_3 = pd.DataFrame({'age': [1, np.nan, 3]}) >>> mapper3.fit_transform(data_3) @@ -302,7 +335,7 @@ ... classes=[sklearn.preprocessing.LabelEncoder] ... ) >>> feature_def - [('col1', [LabelEncoder()]), ('col2', [LabelEncoder()]), ('col3', [LabelEncoder()])] + [('col1', [LabelEncoder()], {}), ('col2', [LabelEncoder()], {}), ('col3', [LabelEncoder()], {})] >>> mapper5 = DataFrameMapper(feature_def) >>> data5 = pd.DataFrame({ ... 'col1': ['yes', 'no', 'yes'], @@ -318,23 +351,42 @@ transformer parameters should be provided. For example, consider a dataset with missing values. Then the following code could be used to override default imputing strategy: + >>> from sklearn.impute import SimpleImputer + >>> import numpy as np >>> feature_def = gen_features( ... columns=[['col1'], ['col2'], ['col3']], - ... classes=[{'class': sklearn.preprocessing.Imputer, 'strategy': 'most_frequent'}] + ... classes=[{'class': SimpleImputer, 'strategy':'most_frequent'}] ... ) >>> mapper6 = DataFrameMapper(feature_def) >>> data6 = pd.DataFrame({ - ... 'col1': [None, 1, 1, 2, 3], - ... 'col2': [True, False, None, None, True], - ... 'col3': [0, 0, 0, None, None] + ... 'col1': [np.nan, 1, 1, 2, 3], + ... 'col2': [True, False, np.nan, np.nan, True], + ... 'col3': [0, 0, 0, np.nan, np.nan] ... }) >>> mapper6.fit_transform(data6) - array([[1., 1., 0.], - [1., 0., 0.], - [1., 1., 0.], - [2., 1., 0.], - [3., 1., 0.]]) + array([[1.0, True, 0.0], + [1.0, False, 0.0], + [1.0, True, 0.0], + [2.0, True, 0.0], + [3.0, True, 0.0]], dtype=object) +You can also specify global prefix or suffix for the generated transformed column names using the prefix and suffix +parameters:: + + >>> feature_def = gen_features( + ... columns=['col1', 'col2', 'col3'], + ... classes=[sklearn.preprocessing.LabelEncoder], + ... prefix="lblencoder_" + ... ) + >>> mapper5 = DataFrameMapper(feature_def) + >>> data5 = pd.DataFrame({ + ... 'col1': ['yes', 'no', 'yes'], + ... 'col2': [True, False, False], + ... 'col3': ['one', 'two', 'three'] + ... }) + >>> _ = mapper5.fit_transform(data5) + >>> mapper5.transformed_names_ + ['lblencoder_col1', 'lblencoder_col2', 'lblencoder_col3'] Feature selection and other supervised transformations ****************************************************** @@ -356,7 +408,8 @@ Working with sparse features **************************** -A ``DataFrameMapper`` will return a dense feature array by default. Setting ``sparse=True`` in the mapper will return a sparse array whenever any of the extracted features is sparse. Example: +A ``DataFrameMapper`` will return a dense feature array by default. Setting ``sparse=True`` in the mapper will return +a sparse array whenever any of the extracted features is sparse. Example: >>> mapper5 = DataFrameMapper([ ... ('pet', CountVectorizer()), @@ -366,87 +419,105 @@ The stacking of the sparse features is done without ever densifying them. -Cross-Validation -**************** -Now that we can combine features from pandas DataFrames, we may want to use cross-validation to see whether our model works. ``scikit-learn<0.16.0`` provided features for cross-validation, but they expect numpy data structures and won't work with ``DataFrameMapper``. +Using ``NumericalTransformer`` +*********************************** -To get around this, sklearn-pandas provides a wrapper on sklearn's ``cross_val_score`` function which passes a pandas DataFrame to the estimator rather than a numpy array:: +While you can use ``FunctionTransformation`` to generate arbitrary transformers, it can present serialization issues +when pickling. Use ``NumericalTransformer`` instead, which takes the function name as a string parameter and hence +can be easily serialized. - >>> pipe = sklearn.pipeline.Pipeline([ - ... ('featurize', mapper), - ... ('lm', sklearn.linear_model.LinearRegression())]) - >>> np.round(cross_val_score(pipe, X=data.copy(), y=data.salary, scoring='r2'), 2) - array([ -1.09, -5.3 , -15.38]) + >>> from sklearn_pandas import NumericalTransformer + >>> mapper5 = DataFrameMapper([ + ... ('children', NumericalTransformer('log')), + ... ]) + >>> mapper5.fit_transform(data) + array([[1.38629436], + [1.79175947], + [1.09861229], + [1.09861229], + [0.69314718], + [1.09861229], + [1.60943791], + [1.38629436]]) + +Changing Logging level +*********************************** -Sklearn-pandas' ``cross_val_score`` function provides exactly the same interface as sklearn's function of the same name. +You can change log level to info to print time take to fit/transform features. Setting it to higher level will stop printing elapsed time. +Below example shows how to change logging level. -``CategoricalImputer`` -********************** -Since the ``scikit-learn`` ``Imputer`` transformer currently only works with -numbers, ``sklearn-pandas`` provides an equivalent helper transformer that -works with strings, substituting null values with the most frequent value in -that column. Alternatively, you can specify a fixed value to use. + >>> import logging + >>> logging.getLogger('sklearn_pandas').setLevel(logging.INFO) -Example: imputing with the mode: - >>> from sklearn_pandas import CategoricalImputer - >>> data = np.array(['a', 'b', 'b', np.nan], dtype=object) - >>> imputer = CategoricalImputer() - >>> imputer.fit_transform(data) - array(['a', 'b', 'b', 'b'], dtype=object) +Changelog +--------- + +2.0.3 (2020-11-06) +****************** -Example: imputing with a fixed value: +* Added elapsed time information for each feature - >>> from sklearn_pandas import CategoricalImputer - >>> data = np.array(['a', 'b', 'b', np.nan], dtype=object) - >>> imputer = CategoricalImputer(strategy='constant', fill_value='a') - >>> imputer.fit_transform(data) - array(['a', 'b', 'b', 'a'], dtype=object) +2.0.2 (2020-10-01) +****************** -``FunctionTransformer`` -*********************** +* Fix `DataFrameMapper` drop_cols attribute naming consistency with scikit-learn and initialization. -Often one wants to apply simple transformations to data such as ``np.log``. ``FunctionTransformer`` is a simple wrapper that takes any function and applies vectorization so that it can be used as a transformer. -Example: +2.0.1 (2020-09-07) +****************** - >>> from sklearn_pandas import FunctionTransformer - >>> array = np.array([10, 100]) - >>> transformer = FunctionTransformer(np.log10) +* Added an option to explicitly drop columns. - >>> transformer.fit_transform(array) - array([1., 2.]) -Changelog ---------- +2.0.0 (2020-08-01) +****************** + +* Deprecated support for Python < 3.6. +* Deprecated support for old versions of scikit-learn, pandas and numpy. Please check setup.py for minimum requirement. +* Removed CategoricalImputer, cross_val_score and GridSearchCV. All these functionality now exists as part of + scikit-learn. Please use SimpleImputer instead of CategoricalImputer. Also + Cross validation from sklearn now supports dataframe so we don't need to use cross validation wrapper provided over + here. +* Added ``NumericalTransformer`` for common numerical transformations. Currently it implements log and log1p + transformation. +* Added prefix and suffix options. See examples above. These are usually helpful when using gen_features. +* Added ``drop_cols`` argument to DataframeMapper. This can be used to explicitly drop columns + 1.8.0 (2018-12-01) ****************** + * Add ``FunctionTransformer`` class (#117). * Fix column names derivation for dataframes with multi-index or non-string columns (#166). * Change behaviour of DataFrameMapper's fit_transform method to invoke each underlying transformers' native fit_transform if implemented. (#150) + 1.7.0 (2018-08-15) ****************** + * Fix issues with unicode names in ``get_names`` (#160). * Update to build using ``numpy==1.14`` and ``python==3.6`` (#154). * Add ``strategy`` and ``fill_value`` parameters to ``CategoricalImputer`` to allow imputing with values other than the mode (#144), (#161). * Preserve input data types when no transform is supplied (#138). + 1.6.0 (2017-10-28) ****************** + * Add column name to exception during fit/transform (#110). * Add ``gen_feature`` helper function to help generating the same transformation for multiple columns (#126). 1.5.0 (2017-06-24) ****************** + * Allow inputting a dataframe/series per group of columns. * Get feature names also from ``estimator.get_feature_names()`` if present. * Attempt to derive feature names from individual transformers when applying a @@ -457,6 +528,7 @@ 1.4.0 (2017-05-13) ****************** + * Allow specifying a custom name (alias) for transformed columns (#83). * Capture output columns generated names in ``transformed_names_`` attribute (#78). * Add ``CategoricalImputer`` that replaces null-like values with the mode @@ -534,3 +606,5 @@ * Timothy Sweetser (@hacktuarial) * Vitaley Zaretskey (@vzaretsk) * Zac Stewart (@zacstewart) +* Parul Singh (@paro1234) +* Vincent Heusinkveld (@VHeusinkveld) diff -Nru sklearn-pandas-1.8.0/setup.py sklearn-pandas-2.0.3/setup.py --- sklearn-pandas-1.8.0/setup.py 2016-04-03 11:14:44.000000000 +0000 +++ sklearn-pandas-2.0.3/setup.py 2020-11-27 13:01:14.000000000 +0000 @@ -32,16 +32,17 @@ setup(name='sklearn-pandas', version=__version__, description='Pandas integration with sklearn', - maintainer='Israel Saeta Pérez', - maintainer_email='israel.saeta@dukebody.com', - url='https://github.com/paulgb/sklearn-pandas', + maintainer='Ritesh Agrawal', + maintainer_email='ragrawal@gmail.com', + url='https://github.com/scikit-learn-contrib/sklearn-pandas', packages=['sklearn_pandas'], keywords=['scikit', 'sklearn', 'pandas'], install_requires=[ - 'scikit-learn>=0.15.0', - 'scipy>=0.14', - 'pandas>=0.11.0', - 'numpy>=1.6.1'], + 'scikit-learn>=0.23.0', + 'scipy>=1.4.1', + 'pandas>=1.0.5', + 'numpy>=1.18.1' + ], tests_require=['pytest', 'mock'], cmdclass={'test': PyTest}, ) diff -Nru sklearn-pandas-1.8.0/sklearn_pandas/categorical_imputer.py sklearn-pandas-2.0.3/sklearn_pandas/categorical_imputer.py --- sklearn-pandas-1.8.0/sklearn_pandas/categorical_imputer.py 2018-10-21 10:55:27.000000000 +0000 +++ sklearn-pandas-2.0.3/sklearn_pandas/categorical_imputer.py 1970-01-01 00:00:00.000000000 +0000 @@ -1,134 +0,0 @@ -import pandas as pd -import numpy as np - - -from sklearn.base import BaseEstimator, TransformerMixin -from sklearn.utils.validation import check_is_fitted - - -def _get_mask(X, value): - """ - Compute the boolean mask X == missing_values. - """ - if value == "NaN" or \ - value is None or \ - (isinstance(value, float) and np.isnan(value)): - return pd.isnull(X) - else: - return X == value - - -class CategoricalImputer(BaseEstimator, TransformerMixin): - """ - Impute missing values from a categorical/string np.ndarray or pd.Series - with the most frequent value on the training data. - - Parameters - ---------- - missing_values : string or "NaN", optional (default="NaN") - The placeholder for the missing values. All occurrences of - `missing_values` will be imputed. None and np.nan are treated - as being the same, use the string value "NaN" for them. - - copy : boolean, optional (default=True) - If True, a copy of X will be created. - - strategy : string, optional (default = 'most_frequent') - The imputation strategy. - - - If "most_frequent", then replace missing using the most frequent - value along each column. Can be used with strings or numeric data. - - If "constant", then replace missing values with fill_value. Can be - used with strings or numeric data. - - fill_value : string, optional (default='?') - The value that all instances of `missing_values` are replaced - with if `strategy` is set to `constant`. This is useful if - you don't want to impute with the mode, or if there are multiple - modes in your data and you want to choose a particular one. If - `strategy` is not set to `constant`, this parameter is ignored. - - Attributes - ---------- - fill_ : str - The imputation fill value - - """ - - def __init__( - self, - missing_values='NaN', - strategy='most_frequent', - fill_value='?', - copy=True - ): - self.missing_values = missing_values - self.copy = copy - self.fill_value = fill_value - self.strategy = strategy - - strategies = ['constant', 'most_frequent'] - if self.strategy not in strategies: - raise ValueError( - 'Strategy {0} not in {1}'.format(self.strategy, strategies) - ) - - def fit(self, X, y=None): - """ - - Get the most frequent value. - - Parameters - ---------- - X : np.ndarray or pd.Series - Training data. - - y : Passthrough for ``Pipeline`` compatibility. - - Returns - ------- - self: CategoricalImputer - """ - - mask = _get_mask(X, self.missing_values) - X = X[~mask] - if self.strategy == 'most_frequent': - modes = pd.Series(X).mode() - elif self.strategy == 'constant': - modes = np.array([self.fill_value]) - if modes.shape[0] == 0: - raise ValueError('Data is empty or all values are null') - elif modes.shape[0] > 1: - raise ValueError('No value is repeated more than ' - 'once in the column') - else: - self.fill_ = modes[0] - - return self - - def transform(self, X): - """ - - Replaces missing values in the input data with the most frequent value - of the training data. - - Parameters - ---------- - X : np.ndarray or pd.Series - Data with values to be imputed. - - Returns - ------- - np.ndarray - Data with imputed values. - """ - - check_is_fitted(self, 'fill_') - - if self.copy: - X = X.copy() - - mask = _get_mask(X, self.missing_values) - X[mask] = self.fill_ - - return np.asarray(X) diff -Nru sklearn-pandas-1.8.0/sklearn_pandas/cross_validation.py sklearn-pandas-2.0.3/sklearn_pandas/cross_validation.py --- sklearn-pandas-1.8.0/sklearn_pandas/cross_validation.py 2017-04-17 10:14:52.000000000 +0000 +++ sklearn-pandas-2.0.3/sklearn_pandas/cross_validation.py 2020-11-27 13:01:14.000000000 +0000 @@ -1,59 +1,3 @@ -import warnings -try: - from sklearn.model_selection import cross_val_score as sk_cross_val_score - from sklearn.model_selection import GridSearchCV as SKGridSearchCV - from sklearn.model_selection import RandomizedSearchCV as \ - SKRandomizedSearchCV -except ImportError: - from sklearn.cross_validation import cross_val_score as sk_cross_val_score - from sklearn.grid_search import GridSearchCV as SKGridSearchCV - from sklearn.grid_search import RandomizedSearchCV as SKRandomizedSearchCV - -DEPRECATION_MSG = ''' - Custom cross-validation compatibility shims are no longer needed for - scikit-learn>=0.16.0 and will be dropped in sklearn-pandas==2.0. -''' - - -def cross_val_score(model, X, *args, **kwargs): - warnings.warn(DEPRECATION_MSG, DeprecationWarning) - X = DataWrapper(X) - return sk_cross_val_score(model, X, *args, **kwargs) - - -class GridSearchCV(SKGridSearchCV): - - def __init__(self, *args, **kwargs): - warnings.warn(DEPRECATION_MSG, DeprecationWarning) - super(GridSearchCV, self).__init__(*args, **kwargs) - - def fit(self, X, *params, **kwparams): - return super(GridSearchCV, self).fit( - DataWrapper(X), *params, **kwparams) - - def predict(self, X, *params, **kwparams): - return super(GridSearchCV, self).predict( - DataWrapper(X), *params, **kwparams) - - -try: - class RandomizedSearchCV(SKRandomizedSearchCV): - - def __init__(self, *args, **kwargs): - warnings.warn(DEPRECATION_MSG, DeprecationWarning) - super(RandomizedSearchCV, self).__init__(*args, **kwargs) - - def fit(self, X, *params, **kwparams): - return super(RandomizedSearchCV, self).fit( - DataWrapper(X), *params, **kwparams) - - def predict(self, X, *params, **kwparams): - return super(RandomizedSearchCV, self).predict( - DataWrapper(X), *params, **kwparams) -except AttributeError: - pass - - class DataWrapper(object): def __init__(self, df): diff -Nru sklearn-pandas-1.8.0/sklearn_pandas/dataframe_mapper.py sklearn-pandas-2.0.3/sklearn_pandas/dataframe_mapper.py --- sklearn-pandas-1.8.0/sklearn_pandas/dataframe_mapper.py 2018-08-15 12:42:44.000000000 +0000 +++ sklearn-pandas-2.0.3/sklearn_pandas/dataframe_mapper.py 2020-11-27 13:01:14.000000000 +0000 @@ -1,6 +1,6 @@ -import sys import contextlib +from datetime import datetime import pandas as pd import numpy as np from scipy import sparse @@ -8,13 +8,9 @@ from .cross_validation import DataWrapper from .pipeline import make_transformer_pipeline, _call_fit, TransformerPipeline +from . import logger -PY3 = sys.version_info[0] == 3 -if PY3: - string_types = text_type = str -else: - string_types = basestring # noqa - text_type = unicode # noqa +string_types = text_type = str def _handle_feature(fea): @@ -37,6 +33,10 @@ return (columns, _build_transformer(transformers), options) +def _elapsed_secs(t1): + return (datetime.now()-t1).total_seconds() + + def _get_feature_names(estimator): """ Attempt to extract feature names based on a given estimator @@ -69,7 +69,7 @@ """ def __init__(self, features, default=False, sparse=False, df_out=False, - input_df=False): + input_df=False, drop_cols=None): """ Params: @@ -77,7 +77,7 @@ The first element is the pandas column selector. This can be a string (for one column) or a list of strings. The second element is an object that supports - sklearn's transform interface, or a list of such objects. + sklearn's transform interface, or a list of such objects The third element is optional and, if present, must be a dictionary with the options to apply to the transformation. Example: {'alias': 'day_of_week'} @@ -101,16 +101,18 @@ input_df If ``True`` pass the selected columns to the transformers as a pandas DataFrame or Series. Otherwise pass them as a numpy array. Defaults to ``False``. + + drop_cols List of columns to be dropped. Defaults to None. + """ self.features = features - self.built_features = None self.default = default self.built_default = None self.sparse = sparse self.df_out = df_out self.input_df = input_df + self.drop_cols = [] if drop_cols is None else drop_cols self.transformed_names_ = [] - if (df_out and (sparse or default)): raise ValueError("Can not use df_out with sparse or default") @@ -149,7 +151,8 @@ """ X_columns = list(X.columns) return [column for column in X_columns if - column not in self._selected_columns] + column not in self._selected_columns + and column not in self.drop_cols] def __setstate__(self, state): # compatibility for older versions of sklearn-pandas @@ -158,6 +161,7 @@ self.default = state.get('default', False) self.df_out = state.get('df_out', False) self.input_df = state.get('input_df', False) + self.drop_cols = state.get('drop_cols', []) self.built_features = state.get('built_features', self.features) self.built_default = state.get('built_default', self.default) self.transformed_names_ = state.get('transformed_names_', []) @@ -211,12 +215,14 @@ self._build() for columns, transformers, options in self.built_features: + t1 = datetime.now() input_df = options.get('input_df', self.input_df) if transformers is not None: with add_column_names_to_exception(columns): Xt = self._get_col_subset(X, columns, input_df) _call_fit(transformers.fit, Xt, y) + logger.info(f"[FIT] {columns}: {_elapsed_secs(t1)} secs") # handle features not explicitly selected if self.built_default: # not False and not None @@ -226,7 +232,8 @@ _call_fit(self.built_default.fit, Xt, y) return self - def get_names(self, columns, transformer, x, alias=None): + def get_names(self, columns, transformer, x, alias=None, prefix='', + suffix=''): """ Return verbose names for the transformed columns. @@ -242,6 +249,9 @@ else: name = columns num_cols = x.shape[1] if len(x.shape) > 1 else 1 + + output = [] + if num_cols > 1: # If there are as many columns as classes in the transformer, # infer column names from classes names. @@ -257,13 +267,19 @@ # Otherwise use the only estimator present else: names = _get_feature_names(transformer) + if names is not None and len(names) == num_cols: - return ['%s_%s' % (name, o) for o in names] - # otherwise, return name concatenated with '_1', '_2', etc. + output = [f"{name}_{o}" for o in names] + # otherwise, return name concatenated with '_1', '_2', etc. else: - return [name + '_' + str(o) for o in range(num_cols)] + output = [name + '_' + str(o) for o in range(num_cols)] else: - return [name] + output = [name] + + if prefix == suffix == "": + return output + + return ['{}{}{}'.format(prefix, x, suffix) for x in output] def get_dtypes(self, extracted): dtypes_features = [self.get_dtype(ex) for ex in extracted] @@ -296,19 +312,32 @@ # strings; we don't care because pandas # will handle either. Xt = self._get_col_subset(X, columns, input_df) + if transformers is not None: with add_column_names_to_exception(columns): if do_fit and hasattr(transformers, 'fit_transform'): + t1 = datetime.now() Xt = _call_fit(transformers.fit_transform, Xt, y) + logger.info(f"[FIT_TRANSFORM] {columns}: {_elapsed_secs(t1)} secs") # NOQA else: if do_fit: + t1 = datetime.now() _call_fit(transformers.fit, Xt, y) + logger.info( + f"[FIT] {columns}: {_elapsed_secs(t1)} secs") + + t1 = datetime.now() Xt = transformers.transform(Xt) + logger.info(f"[TRANSFORM] {columns}: {_elapsed_secs(t1)} secs") # NOQA + extracted.append(_handle_feature(Xt)) alias = options.get('alias') + prefix = options.get('prefix', '') + suffix = options.get('suffix', '') + self.transformed_names_ += self.get_names( - columns, transformers, Xt, alias) + columns, transformers, Xt, alias, prefix, suffix) # handle features not explicitly selected if self.built_default is not False: @@ -328,6 +357,7 @@ # if not applying a default transformer, # keep column names unmodified self.transformed_names_ += unsel_cols + extracted.append(_handle_feature(Xt)) # combine the feature outputs into one array. diff -Nru sklearn-pandas-1.8.0/sklearn_pandas/features_generator.py sklearn-pandas-2.0.3/sklearn_pandas/features_generator.py --- sklearn-pandas-1.8.0/sklearn_pandas/features_generator.py 2017-10-22 17:58:20.000000000 +0000 +++ sklearn-pandas-2.0.3/sklearn_pandas/features_generator.py 2020-11-27 13:01:14.000000000 +0000 @@ -1,4 +1,4 @@ -def gen_features(columns, classes=None): +def gen_features(columns, classes=None, prefix='', suffix=''): """Generates a feature definition list which can be passed into DataFrameMapper @@ -25,6 +25,10 @@ If None value selected, then each feature left as is. + prefix add prefix to transformed column names + + suffix add suffix to transformed column names. + """ if classes is None: return [(column, None) for column in columns] @@ -34,9 +38,15 @@ for column in columns: feature_transformers = [] + arguments = {} + if prefix and prefix != "": + arguments['prefix'] = prefix + if suffix and suffix != "": + arguments['suffix'] = suffix + classes = [cls for cls in classes if cls is not None] if not classes: - feature_defs.append((column, None)) + feature_defs.append((column, None, arguments)) else: for definition in classes: @@ -50,6 +60,6 @@ if not feature_transformers: feature_transformers = None - feature_defs.append((column, feature_transformers)) + feature_defs.append((column, feature_transformers, arguments)) return feature_defs diff -Nru sklearn-pandas-1.8.0/sklearn_pandas/__init__.py sklearn-pandas-2.0.3/sklearn_pandas/__init__.py --- sklearn-pandas-1.8.0/sklearn_pandas/__init__.py 2018-12-01 19:13:33.000000000 +0000 +++ sklearn-pandas-2.0.3/sklearn_pandas/__init__.py 2020-11-27 13:01:14.000000000 +0000 @@ -1,6 +1,8 @@ -__version__ = '1.8.0' +__version__ = '2.0.3' + +import logging +logger = logging.getLogger(__name__) from .dataframe_mapper import DataFrameMapper # NOQA -from .cross_validation import cross_val_score, GridSearchCV, RandomizedSearchCV # NOQA -from .transformers import CategoricalImputer, FunctionTransformer # NOQA from .features_generator import gen_features # NOQA +from .transformers import NumericalTransformer # NOQA diff -Nru sklearn-pandas-1.8.0/sklearn_pandas/transformers.py sklearn-pandas-2.0.3/sklearn_pandas/transformers.py --- sklearn-pandas-1.8.0/sklearn_pandas/transformers.py 2018-12-01 19:13:29.000000000 +0000 +++ sklearn-pandas-2.0.3/sklearn_pandas/transformers.py 2020-11-27 13:01:14.000000000 +0000 @@ -1,8 +1,6 @@ import numpy as np import pandas as pd - -from sklearn.base import BaseEstimator, TransformerMixin -from sklearn.utils.validation import check_is_fitted +from sklearn.base import TransformerMixin def _get_mask(X, value): @@ -17,136 +15,33 @@ return X == value -class CategoricalImputer(BaseEstimator, TransformerMixin): +class NumericalTransformer(TransformerMixin): """ - Impute missing values from a categorical/string np.ndarray or pd.Series - with the most frequent value on the training data. - - Parameters - ---------- - missing_values : string or "NaN", optional (default="NaN") - The placeholder for the missing values. All occurrences of - `missing_values` will be imputed. None and np.nan are treated - as being the same, use the string value "NaN" for them. - - copy : boolean, optional (default=True) - If True, a copy of X will be created. - - strategy : string, optional (default = 'most_frequent') - The imputation strategy. - - - If "most_frequent", then replace missing using the most frequent - value along each column. Can be used with strings or numeric data. - - If "constant", then replace missing values with fill_value. Can be - used with strings or numeric data. - - fill_value : string, optional (default='?') - The value that all instances of `missing_values` are replaced - with if `strategy` is set to `constant`. This is useful if - you don't want to impute with the mode, or if there are multiple - modes in your data and you want to choose a particular one. If - `strategy` is not set to `constant`, this parameter is ignored. - - Attributes - ---------- - fill_ : str - The imputation fill value - + Provides commonly used numerical transformers. """ + SUPPORTED_FUNCTIONS = ['log', 'log1p'] - def __init__( - self, - missing_values='NaN', - strategy='most_frequent', - fill_value='?', - copy=True - ): - self.missing_values = missing_values - self.copy = copy - self.fill_value = fill_value - self.strategy = strategy - - strategies = ['constant', 'most_frequent'] - if self.strategy not in strategies: - raise ValueError( - 'Strategy {0} not in {1}'.format(self.strategy, strategies) - ) - - def fit(self, X, y=None): - """ - - Get the most frequent value. - - Parameters - ---------- - X : np.ndarray or pd.Series - Training data. - - y : Passthrough for ``Pipeline`` compatibility. - - Returns - ------- - self: CategoricalImputer - """ - - mask = _get_mask(X, self.missing_values) - X = X[~mask] - if self.strategy == 'most_frequent': - modes = pd.Series(X).mode() - elif self.strategy == 'constant': - modes = np.array([self.fill_value]) - if modes.shape[0] == 0: - raise ValueError('Data is empty or all values are null') - elif modes.shape[0] > 1: - raise ValueError('No value is repeated more than ' - 'once in the column') - else: - self.fill_ = modes[0] - - return self - - def transform(self, X): + def __init__(self, func): """ + Params - Replaces missing values in the input data with the most frequent value - of the training data. - - Parameters - ---------- - X : np.ndarray or pd.Series - Data with values to be imputed. - - Returns - ------- - np.ndarray - Data with imputed values. + func function to apply to input columns. The function will be + applied to each value. Supported functions are defined + in SUPPORTED_FUNCTIONS variable. Throws assertion error if the + not supported. """ - - check_is_fitted(self, 'fill_') - - if self.copy: - X = X.copy() - - mask = _get_mask(X, self.missing_values) - X[mask] = self.fill_ - - return np.asarray(X) - - -class FunctionTransformer(BaseEstimator, TransformerMixin): - """ - Use this class to convert a random function into a - transformer. - """ - - def __init__(self, func): + assert func in self.SUPPORTED_FUNCTIONS, \ + f"Only following func are supported: {self.SUPPORTED_FUNCTIONS}" + super(NumericalTransformer, self).__init__() self.__func = func - def fit(self, x, y=None): + def fit(self, X, y=None): return self - def transform(self, x): - return np.vectorize(self.__func)(x) + def transform(self, X, y=None): + if self.__func == 'log1p': + return np.vectorize(np.log1p)(X) + elif self.__func == 'log': + return np.vectorize(np.log)(X) - def __call__(self, *args, **kwargs): - return self.__func(*args, **kwargs) + raise ValueError(f"Invalid function name: {self.__func}") diff -Nru sklearn-pandas-1.8.0/sklearn_pandas.egg-info/PKG-INFO sklearn-pandas-2.0.3/sklearn_pandas.egg-info/PKG-INFO --- sklearn-pandas-1.8.0/sklearn_pandas.egg-info/PKG-INFO 2018-12-01 19:14:57.000000000 +0000 +++ sklearn-pandas-2.0.3/sklearn_pandas.egg-info/PKG-INFO 2020-11-27 13:01:14.000000000 +0000 @@ -1,12 +1,11 @@ -Metadata-Version: 1.0 +Metadata-Version: 1.2 Name: sklearn-pandas -Version: 1.8.0 +Version: 2.0.3 Summary: Pandas integration with sklearn -Home-page: https://github.com/paulgb/sklearn-pandas -Author: Israel Saeta Pérez -Author-email: israel.saeta@dukebody.com +Home-page: https://github.com/scikit-learn-contrib/sklearn-pandas +Maintainer: Ritesh Agrawal +Maintainer-email: ragrawal@gmail.com License: UNKNOWN -Description-Content-Type: UNKNOWN Description: UNKNOWN Keywords: scikit,sklearn,pandas Platform: UNKNOWN diff -Nru sklearn-pandas-1.8.0/sklearn_pandas.egg-info/requires.txt sklearn-pandas-2.0.3/sklearn_pandas.egg-info/requires.txt --- sklearn-pandas-1.8.0/sklearn_pandas.egg-info/requires.txt 2018-12-01 19:14:57.000000000 +0000 +++ sklearn-pandas-2.0.3/sklearn_pandas.egg-info/requires.txt 2020-11-27 13:01:14.000000000 +0000 @@ -1,4 +1,4 @@ -scikit-learn>=0.15.0 -scipy>=0.14 -pandas>=0.11.0 -numpy>=1.6.1 +scikit-learn>=0.23.0 +scipy>=1.4.1 +pandas>=1.0.5 +numpy>=1.18.1 diff -Nru sklearn-pandas-1.8.0/sklearn_pandas.egg-info/SOURCES.txt sklearn-pandas-2.0.3/sklearn_pandas.egg-info/SOURCES.txt --- sklearn-pandas-1.8.0/sklearn_pandas.egg-info/SOURCES.txt 2018-12-01 19:14:57.000000000 +0000 +++ sklearn-pandas-2.0.3/sklearn_pandas.egg-info/SOURCES.txt 2020-11-27 13:01:14.000000000 +0000 @@ -4,7 +4,6 @@ setup.cfg setup.py sklearn_pandas/__init__.py -sklearn_pandas/categorical_imputer.py sklearn_pandas/cross_validation.py sklearn_pandas/dataframe_mapper.py sklearn_pandas/features_generator.py