encoderpy package¶
Submodules¶
encoderpy.conjugate_encoder module¶
-
encoderpy.conjugate_encoder.conjugate_encoder(X_train, y, cat_columns, prior_params, X_test=None, objective='regression')¶ This function encodes categorical variables by fitting a posterior distribution per each category to the target variable y, using a known conjugate-prior. The resulting mean(s) of each posterior distribution per each category are used as the encodings.
Parameters: - X_train (pd.DataFrame) – A pandas dataframe representing the training data set containing some categorical features/columns.
- y (pd.Series) – A pandas series representing the target variable. If the objective is “binary”, then this series should only contain two unique values.
- cat_columns (list) – The names of the categorical features in X_train and/or X_test.
- prior_params (dict) – A dictionary of parameters for each prior distribution assumed. For regression, this requires a dictionary with four keys and four values: mu, vega, alpha, beta. All must be real numbers, and must be greater than 0 except for mu, which can be negative. A value of alpha > 1 is strongly advised. For binary classification, this requires a dictionary with two keys and two values: alpha, beta. All must be real numbers and be greater than 0.
- X_test (pd.DataFrame) – A pandas dataframe representing the test set, containing some set of categorical features/columns. This is an optional argument.
- objective (str) – A string, either “regression” or “binary” specifying the problem. Default is regression. For regression, a normal-inverse gamma prior + normal likelihood is assumed. For binary classification, a beta prior with binomial likelihood is assumed.
Returns: - train_processed (pd.DataFrame) – The training set, with the categorical columns specified by the argument cat_columns replaced by their encodings. For regression, the encodings will return 2 columns, since the normal-inverse gamma distribution is two dimensional. For binary classification, the encodings will return 1 column.
- test_processed (pd.DataFrame) – The test set, with the categorical columns specified by the argument cat_columns replaced by the learned encodings from the training set. This is not returned if X_test is None.
References
Slakey et al., “Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine”, 2019.
Examples
>>> encodings = conjugate_encoder( my_train, my_test, my_train['y'], cat_columns = ['foo'], prior = {alpha: 3, beta: 3}, objective = "binary")
>>> train_new = encodings[0]
encoderpy.encoderpy module¶
encoderpy.frequency_encoder module¶
-
encoderpy.frequency_encoder.frequency_encoder(X_train, cat_columns, X_test=None, prior=0.5)¶ This function encodes categorical variables using the frequencies of each category.
Parameters: - X_train (pd.DataFrame) – A pandas dataframe representing the training data set containing some categorical features/columns.
- X_test (pd.DataFrame) – A pandas dataframe representing the test set, containing some set of categorical features/columns. This is an optional argument.
- cat_columns (list) – The names of the categorical features in X_train and/or X_test.
- prior (float) – A number in [0, inf] that acts as pseudo counts when calculating the encodings. Useful for preventing encodings of 0 for when the training set does not have particular categories observed in the test set. A larger value gives less weight to what is observed in the training set. A value of 0 incorporates no prior information. The default value is 0.5.
Returns: - train_processed (pd.DataFrame) – The training set, with the categorical columns specified by the argument cat_columns replaced by their encodings.
- test_processed (pd.DataFrame) – The test set, with the categorical columns specified by the argument cat_columns replaced by the learned encodings from the training set. This is not returned if X_test is None.
Examples
>>> encodings = frequency_encoder( my_train, my_test, cat_columns = ['foo'])
>>> train_new = encodings[0]
encoderpy.onehot_encoder module¶
-
encoderpy.onehot_encoder.onehot_encoder(X_train, cat_columns, X_test=None)¶ This function encodes categorical variables using the popular onehot method for each category.
Parameters: - X_train (pd.DataFrame) – A pandas dataframe representing the training data set containing some categorical features/columns.
- X_test (pd.DataFrame) – A pandas dataframe representing the test set, containing some set of categorical features/columns. This is an optional argument.
- cat_columns (list) – The names of the categorical features in X_train and/or X_test.
Returns: - train_processed (pd.DataFrame) – The training set, with the categorical columns specified by the argument cat_columns replaced by their encodings.
- test_processed (pd.DataFrame) – The test set, with the categorical columns specified by the argument cat_columns replaced by the learned encodings from the training set. This is not returned if X_test is None.
Examples
>>> encodings = onehot_encoder( my_train, my_test, cat_columns = ['foo'])
>>> train_new = encodings[0]
encoderpy.target_encoder module¶
-
encoderpy.target_encoder.target_encoder(X_train, y, cat_columns, X_test=None, prior=0.5, objective='regression')¶ This function encodes categorical variables with average target values for each category.
Parameters: - X_train (pd.DataFrame) – A pandas dataframe representing the training data set containing some categorical features/columns.
- y (pd.Series) – A pandas series representing the target variable. If the objective is “binary”, then this series should only contain two unique values.
- cat_columns (list) – The names of the categorical features in X_train and/or X_test.
- prior (float) – A number in [0, inf] that acts as pseudo counts when calculating the encodings. Useful for preventing encodings of 0 for when the training set does not have particular categories observed in the test set. A larger value gives less weight to what is observed in the training set. A value of 0 incorporates no prior information. The default value is 0.5.
- X_test (pd.DataFrame) – A pandas dataframe representing the test set, containing some set of categorical features/columns. This is an optional argument.
- objective (string) – A string, either “regression” or “binary” specifying the problem. Default is regression.
Returns: - train_processed (pd.DataFrame) – The training set, with the categorical columns specified by the argument cat_columns replaced by their encodings.
- test_processed (pd.DataFrame) – The test set, with the categorical columns specified by the argument cat_columns replaced by the learned encodings from the training set. This is not returned if X_test is None.
Examples
>>> encodings = target_encoder( my_train, my_train['y'], cat_columns = ['foo'], prior = 0.5, my_test, 'regression')
>>> train_new = encodings[0]