Module leapyear.feature

Feature engineering classes.

OneHotEncoder class

class leapyear.feature.OneHotEncoder(input_cols, max_size=32, drop_originals=True, drop_last=True)

One-hot encode attributes.

FACTOR and INT columns that can have less than max_size values are converted to BOOL columns indicating the presence of the value. Will only work with non-nullable columns. Nullable columns can be converted to non-nullable with leapyear.dataset.Attribute.coalesce().

The last category is not included by default, where the categories are sorted lexicographically based on their characters’ ASCII values.

Examples

  1. Using OneHotEncoder on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> ohe = OneHotEncoder(['col1', 'col2'], drop_originals=True, max_size=64)
>>> ds2 = ohe.transform(ds1)
Parameters
  • input_cols (Sequence[str]) – the names of the input columns.

  • max_size (int) – maximum number of values to one hot encode per column. (default: 32)

  • drop_originals (bool) – drop columns not derived from the input columns. (default: True)

  • drop_last (bool) – drop the last column containing redundant information. (default: True)

BoundsScaler class

class leapyear.feature.BoundsScaler(input_cols, lower=0.0, upper=1.0)

Scale the attributes by the bounds of the type.

BOOL, INT and REAL columns are scaled so all values fall between min and max (inclusive). In contrast to MinMaxScaler and StandardScaler, there is no privacy leakage using this class.

Examples

  1. Using BoundsScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> bs = BoundsScaler(['col1','col2'])
>>> ds2 = bs.fit_transform(ds1)
Parameters
  • input_cols (Sequence[str]) – the names of the input columns.

  • lower (float) – attributes are scaled to this lower bound (default 0.)

  • upper (float) – attributes are scaled to this upper bound (default 1.)

BoundsAbsScaler class

class leapyear.feature.BoundsAbsScaler(input_cols)

Scale the attributes by the max absolute value of the type bounds.

INT and REAL columns are scaled so all values fall between -1 and 1, with no shifting of the data.

Examples

  1. Using BoundsAbsScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> bs = BoundsAbsScaler(['col1','col2'])
>>> ds2 = bs.fit_transform(ds1)
Parameters

input_cols (Sequence[str]) – the names of the input columns.

MinMaxScaler class

class leapyear.feature.MinMaxScaler(input_cols, min_=0.0, max_=1.0)

Scale the attributes by the min and max of the attribute.

BOOL, INT and REAL columns are scaled so all values fall between min and max (inclusive).

Examples

  1. Using MinMaxScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> ms = MinMaxScaler(['col1','col2'], min_=0.0, max_=1.0)
>>> ds2 = ms.fit_transform(ds1)
Parameters
  • input_cols (Sequence[str]) – the names of the input columns.

  • min (float) – attributes are scaled to this lower bound (default 0.)

  • max (float) – attributes are scaled to this lower bound (default 1.)

MaxAbsScaler class

class leapyear.feature.MaxAbsScaler(input_cols)

Scale the attributes by the max absolute value of the min and the max.

INT and REAL columns are scaled so that values smaller than the absolute value of the min or max fall between -1 and 1, with no shifting of the data.

Examples

  1. Using BoundsAbsScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> bs = BoundsAbsScaler(['col1','col2'])
>>> ds2 = bs.fit_transform(ds1)
Parameters

input_cols (Sequence[str]) – the names of the input columns.

StandardScaler class

class leapyear.feature.StandardScaler(input_cols, with_mean=True, with_stdev=True)

Scale the attributes to be centered at zero with unit variance.

INT and REAL columns are scaled with removed mean and scaled to unit variance.

Examples

  1. Using StandardScaler on columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> ss = StandardScaler(['col1','col2'], with_mean=True, with_stdev=False )
>>> ds2 = ss.fit_transform(ds1)
Parameters
  • input_cols (Sequence[str]) – the names of the input columns.

  • with_mean (bool) – Remove the mean from the attributes.

  • with_stdev (bool) – Scale the attribute to have unit standard deviation.

ScaleTransformModel class

class leapyear.feature.ScaleTransformModel(attr_lower_upper, lower, upper, scale_bool)

Scale attributes to new values.

Shift and scale each attribute so the 2 values associated with each attribute are mapped to the 2 values in new_values.

Normalizer class

class leapyear.feature.Normalizer(input_cols, p, suffix=<factory>)

Compute the p-norm of attributes and normalize by norm value.

Will only work on INT and REAL columns.

Examples

  1. Using Normalizer on columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> norm_trn = Normalizer(['col1','col2'], p = 2)
>>> ds2 = norm_trn.fit_transform(ds1)
Parameters
  • input_cols (Sequence[str]) – The list of the input columns that needs to normalized.

  • p (int) – norm p

  • suffix (str) – Suffix appended to the name of the transformed attribute. Defaults to ‘_NORM’ for Snowflake and ‘_norm’ for Spark

fit_transform(dataset, **kwargs)

Normalize a set of attributes.

Computes p norm.

Return type

DataSet

Winsorizer class

class leapyear.feature.Winsorizer(input_col, lo_val, hi_val, suffix=<factory>)

Bound the non-null values of the attribute to be within lo_val and hi_val.

Will only work on non-nullable INT and REAL columns. Nullable columns can be converted to non-nullable with leapyear.dataset.Attribute.coalesce().

When transformed, returns the DataSet with an additional attribute that is winsorised between the given low and high value for the specified attribute.

Examples

  1. Using Winsorizer on column ‘col1’ in Dataset ‘ds1’:

>>> wins = Winsorizer('col1', lo_val = 0, hi_val = 1)
>>> ds2 = wins.fit_transform(ds1)
Parameters
  • input_col (str) – The name of the input columns.

  • lo_val (float) – After transformation attribute will be greater than or equal to lo_val

  • hi_val (float) – After transformation attribute will be lesser than or equal to hi_val

  • suffix (str) – Suffix appended to the name of the transformed attribute. Defaults to ‘_WIN’ for Snowflake and ‘_win’ for Spark

fit_transform(dataset, **kwargs)

Winsorize the specified attribute.

Called after providing lo and hi vals.

Return type

DataSet

Bucketizer class

class leapyear.feature.Bucketizer(input_col, split_vals)

Quantize the attribute according to thresholds specified in split_vals.

attr < split_vals[0] -> bin 0
attr >= split_vals[0] and attr < split_vals[1] -> bin 1
...
attr >= split_vals[-1] -> bin len(split_vals)

Works with INT and REAL columns.

Examples

  1. Using Bucketizer on column ‘col1’ in Dataset ‘ds1’:

>>> split_vals = [0, 0.25, 0.75]
>>> buck = Bucketizer('col1', split_vals = split_vals )
>>> ds2 = buck.fit_transform(ds1)
Parameters
  • input_col (str) – the names of the input column.

  • split_vals (Sequence[float]) – Thresholds for creating the bins

fit_transform(dataset, **kwargs)

For testing use only.

Return type

DataSet