encoderpy package

Submodules

encoderpy.conjugate_encoder module

encoderpy.conjugate_encoder.conjugate_encoder(X_train, y, cat_columns, prior_params, X_test=None, objective='regression')

This function encodes categorical variables by fitting a posterior distribution per each category to the target variable y, using a known conjugate-prior. The resulting mean(s) of each posterior distribution per each category are used as the encodings.

Parameters:
  • X_train (pd.DataFrame) – A pandas dataframe representing the training data set containing some categorical features/columns.
  • y (pd.Series) – A pandas series representing the target variable. If the objective is “binary”, then this series should only contain two unique values.
  • cat_columns (list) – The names of the categorical features in X_train and/or X_test.
  • prior_params (dict) – A dictionary of parameters for each prior distribution assumed. For regression, this requires a dictionary with four keys and four values: mu, vega, alpha, beta. All must be real numbers, and must be greater than 0 except for mu, which can be negative. A value of alpha > 1 is strongly advised. For binary classification, this requires a dictionary with two keys and two values: alpha, beta. All must be real numbers and be greater than 0.
  • X_test (pd.DataFrame) – A pandas dataframe representing the test set, containing some set of categorical features/columns. This is an optional argument.
  • objective (str) – A string, either “regression” or “binary” specifying the problem. Default is regression. For regression, a normal-inverse gamma prior + normal likelihood is assumed. For binary classification, a beta prior with binomial likelihood is assumed.
Returns:

  • train_processed (pd.DataFrame) – The training set, with the categorical columns specified by the argument cat_columns replaced by their encodings. For regression, the encodings will return 2 columns, since the normal-inverse gamma distribution is two dimensional. For binary classification, the encodings will return 1 column.
  • test_processed (pd.DataFrame) – The test set, with the categorical columns specified by the argument cat_columns replaced by the learned encodings from the training set. This is not returned if X_test is None.

References

Slakey et al., “Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine”, 2019.

Examples

>>> encodings = conjugate_encoder(
my_train,
my_test,
my_train['y'],
cat_columns = ['foo'],
prior = {alpha: 3, beta: 3},
objective = "binary")
>>> train_new = encodings[0]

encoderpy.encoderpy module

encoderpy.frequency_encoder module

encoderpy.frequency_encoder.frequency_encoder(X_train, cat_columns, X_test=None, prior=0.5)

This function encodes categorical variables using the frequencies of each category.

Parameters:
  • X_train (pd.DataFrame) – A pandas dataframe representing the training data set containing some categorical features/columns.
  • X_test (pd.DataFrame) – A pandas dataframe representing the test set, containing some set of categorical features/columns. This is an optional argument.
  • cat_columns (list) – The names of the categorical features in X_train and/or X_test.
  • prior (float) – A number in [0, inf] that acts as pseudo counts when calculating the encodings. Useful for preventing encodings of 0 for when the training set does not have particular categories observed in the test set. A larger value gives less weight to what is observed in the training set. A value of 0 incorporates no prior information. The default value is 0.5.
Returns:

  • train_processed (pd.DataFrame) – The training set, with the categorical columns specified by the argument cat_columns replaced by their encodings.
  • test_processed (pd.DataFrame) – The test set, with the categorical columns specified by the argument cat_columns replaced by the learned encodings from the training set. This is not returned if X_test is None.

Examples

>>> encodings = frequency_encoder(
my_train,
my_test,
cat_columns = ['foo'])
>>> train_new = encodings[0]

encoderpy.onehot_encoder module

encoderpy.onehot_encoder.onehot_encoder(X_train, cat_columns, X_test=None)

This function encodes categorical variables using the popular onehot method for each category.

Parameters:
  • X_train (pd.DataFrame) – A pandas dataframe representing the training data set containing some categorical features/columns.
  • X_test (pd.DataFrame) – A pandas dataframe representing the test set, containing some set of categorical features/columns. This is an optional argument.
  • cat_columns (list) – The names of the categorical features in X_train and/or X_test.
Returns:

  • train_processed (pd.DataFrame) – The training set, with the categorical columns specified by the argument cat_columns replaced by their encodings.
  • test_processed (pd.DataFrame) – The test set, with the categorical columns specified by the argument cat_columns replaced by the learned encodings from the training set. This is not returned if X_test is None.

Examples

>>> encodings = onehot_encoder(
my_train,
my_test,
cat_columns = ['foo'])
>>> train_new = encodings[0]

encoderpy.target_encoder module

encoderpy.target_encoder.target_encoder(X_train, y, cat_columns, X_test=None, prior=0.5, objective='regression')

This function encodes categorical variables with average target values for each category.

Parameters:
  • X_train (pd.DataFrame) – A pandas dataframe representing the training data set containing some categorical features/columns.
  • y (pd.Series) – A pandas series representing the target variable. If the objective is “binary”, then this series should only contain two unique values.
  • cat_columns (list) – The names of the categorical features in X_train and/or X_test.
  • prior (float) – A number in [0, inf] that acts as pseudo counts when calculating the encodings. Useful for preventing encodings of 0 for when the training set does not have particular categories observed in the test set. A larger value gives less weight to what is observed in the training set. A value of 0 incorporates no prior information. The default value is 0.5.
  • X_test (pd.DataFrame) – A pandas dataframe representing the test set, containing some set of categorical features/columns. This is an optional argument.
  • objective (string) – A string, either “regression” or “binary” specifying the problem. Default is regression.
Returns:

  • train_processed (pd.DataFrame) – The training set, with the categorical columns specified by the argument cat_columns replaced by their encodings.
  • test_processed (pd.DataFrame) – The test set, with the categorical columns specified by the argument cat_columns replaced by the learned encodings from the training set. This is not returned if X_test is None.

Examples

>>> encodings = target_encoder(
my_train,
my_train['y'],
cat_columns = ['foo'],
prior = 0.5,
my_test,
'regression')
>>> train_new = encodings[0]

Module contents