skbonus.pandas package¶
Subpackages¶
Submodules¶
skbonus.pandas.preprocessing module¶
Preprocess data for training with a focus on pandas compatibility.
-
class
skbonus.pandas.preprocessing.
DateTimeExploder
(name: str, start_column: str, end_column: str, frequency: str, drop: bool = True)¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Transform a pandas dataframe with columns (, start_date, end_date) into a longer format with columns (, date).
This is useful if you deal with datasets that contain special time periods per row, but you need a single date per row. See the examples for more details.
- Parameters
name (str) – Name of the new output date column.
start_column (str) – Start date of the period.
end_column (str) – End date of the period.
frequency (str) – A pandas time frequency. Can take values like “d” for day or “m” for month. A full list can be found on https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases. If None, the transformer tries to infer it during fit time.
drop (bool, default=True) – Whether to drop the start_column and end_column in the transformed output.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({ ... "Data": ["a", "b", "c"], ... "Start": pd.date_range("2020-01-01", periods=3), ... "End": pd.date_range("2020-01-03", periods=3) ... }) >>> df Data Start End 0 a 2020-01-01 2020-01-03 1 b 2020-01-02 2020-01-04 2 c 2020-01-03 2020-01-05
>>> DateTimeExploder(name="output_date", start_column="Start", end_column="End", frequency="d").fit_transform(df) Data output_date 0 a 2020-01-01 0 a 2020-01-02 0 a 2020-01-03 1 b 2020-01-02 1 b 2020-01-03 1 b 2020-01-04 2 c 2020-01-03 2 c 2020-01-04 2 c 2020-01-05
-
fit
(X: pandas.core.frame.DataFrame, y=None) → skbonus.pandas.preprocessing.DateTimeExploder¶ Fits the estimator.
In this special case, nothing is done.
- Parameters
X (Ignored) – Not used, present here for API consistency by convention.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns
Fitted transformer.
- Return type
-
transform
(X: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶ Transform the input.
- Parameters
X (pd.DataFrame) – A pandas dataframe with the columns self.start_column and end_column containing dates.
- Returns
A longer dataframe with one date per row.
- Return type
pd.DataFrame
-
class
skbonus.pandas.preprocessing.
OneHotEncoderWithNames
(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')¶ Bases:
sklearn.preprocessing._encoders.OneHotEncoder
Razor-thin layer around scikit-learn’s OneHotEncoder class to return a pandas dataframe with the appropriate column names.
Description from the maintainers of scikit-learn:
Encode categorical features as a one-hot numeric array. The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the
sparse
parameter).By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.
This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels. Note: a one-hot encoding of y labels should use a LabelBinarizer instead.
- Parameters
categories ('auto' or a list of array-like, default='auto') –
Categories (unique values) per feature:
’auto’ : Determine categories automatically from the training data.
list :
categories[i]
holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.
The used categories can be found in the
categories_
attribute.drop ({'first', 'if_binary'} or a array-like of shape (n_features,), default=None) –
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.
None : retain all features (the default).
’first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
’if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.
array :
drop[i]
is the category in featureX[:, i]
that should be dropped.
sparse (bool, default=True) – Will return sparse matrix if set True else will return an array.
dtype (number type, default=float) – Desired dtype of output.
handle_unknown ({'error', 'ignore'}, default='error') – Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
-
categories_
¶ The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of
transform
). This includes the category specified indrop
(if any).- Type
list of arrays
-
drop_idx_
¶ drop_idx_[i]
is the index incategories_[i]
of the category to be dropped for each feature.drop_idx_[i] = None
if no category is to be dropped from the feature with indexi
, e.g. when drop=’if_binary’ and the feature isn’t binary.drop_idx_ = None
if all the transformed features will be retained.
- Type
array of shape (n_features,)
Examples
>>> import pandas as pd >>> df = pd.DataFrame({'A': [1, 2, 1], 'B': ['a', 'b', 'c']}) >>> OneHotEncoderWithNames().fit_transform(df) A_1 A_2 B_a B_b B_c 0 1 0 1 0 0 1 0 1 0 1 0 2 1 0 0 0 1
-
fit
(X: pandas.core.frame.DataFrame, y: None = None) → skbonus.pandas.preprocessing.OneHotEncoderWithNames¶ Fits a OneHotEncoder while also storing the dataframe column names that let us check if the columns match when calling the transform method.
- Parameters
X (pd.DataFrame) – Fit the OneHotEncoder on this dataframe.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns
Fitted transformer.
- Return type
-
transform
(X: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶ One hot encode the input dataframe.
- Parameters
X (pd.DataFrame) – Input to be one hot encoded. The column names should be the same as during the fit method, including the same order.
- Returns
A pandas dataframe containing the one hot encoded data and proper column names.
- Return type
pd.DataFrame
- Raises
AssertionError – If the column names during training and transformation time are not identical.
skbonus.pandas.utils module¶
Deal with dataframes.
-
skbonus.pandas.utils.
make_df_output
(estimator: Any) → Any¶ Make a scikit-learn transformer output pandas dataframes, if its inputs are dataframes.
- Parameters
estimator (scikit-learn transformer) – Some transformer with a transform method.
- Returns
Transformer with an altered transform method that outputs a dataframe with the same columns and index as the input X.
- Return type
scikit-learn transformer
Module contents¶
This module contains classes for an easy workflow with pandas and scikit-learn.
Usually methods take dataframes as an input and return dataframes again.