Module leapyear.feature¶

Feature engineering classes.

OneHotEncoder class¶

class leapyear.feature.OneHotEncoder(input_cols, max_size=32, drop_originals=True, drop_last=True)¶

One-hot encode attributes.

FACTOR and INT columns that can have less than max_size values are converted to BOOL columns indicating the presence of the value. Will only work with non-nullable columns. Nullable columns can be converted to non-nullable with leapyear.dataset.Attribute.coalesce().

The last category is not included by default, where the categories are sorted lexicographically based on their characters’ ASCII values.

Examples

Using OneHotEncoder on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> ohe = OneHotEncoder(['col1', 'col2'], drop_originals=True, max_size=64)
>>> ds2 = ohe.transform(ds1)

Parameters

input_cols (Sequence[str]) – the names of the input columns.
max_size (int) – maximum number of values to one hot encode per column. (default: 32)
drop_originals (bool) – drop columns not derived from the input columns. (default: True)
drop_last (bool) – drop the last column containing redundant information. (default: True)

BoundsScaler class¶

class leapyear.feature.BoundsScaler(input_cols, lower=0.0, upper=1.0)¶

Scale the attributes by the bounds of the type.

BOOL, INT and REAL columns are scaled so all values fall between min and max (inclusive). In contrast to MinMaxScaler and StandardScaler, there is no privacy leakage using this class.

Examples

Using BoundsScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> bs = BoundsScaler(['col1','col2'])
>>> ds2 = bs.fit_transform(ds1)

Parameters

input_cols (Sequence[str]) – the names of the input columns.
lower (float) – attributes are scaled to this lower bound (default 0.)
upper (float) – attributes are scaled to this upper bound (default 1.)

BoundsAbsScaler class¶

class leapyear.feature.BoundsAbsScaler(input_cols)¶

Scale the attributes by the max absolute value of the type bounds.

INT and REAL columns are scaled so all values fall between -1 and 1, with no shifting of the data.

Examples

Using BoundsAbsScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> bs = BoundsAbsScaler(['col1','col2'])
>>> ds2 = bs.fit_transform(ds1)

Parameters: input_cols (Sequence[str]) – the names of the input columns.

MinMaxScaler class¶

class leapyear.feature.MinMaxScaler(input_cols, min_=0.0, max_=1.0)¶

Scale the attributes by the min and max of the attribute.

BOOL, INT and REAL columns are scaled so all values fall between min and max (inclusive).

Examples

Using MinMaxScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> ms = MinMaxScaler(['col1','col2'], min_=0.0, max_=1.0)
>>> ds2 = ms.fit_transform(ds1)

Parameters

input_cols (Sequence[str]) – the names of the input columns.
min (float) – attributes are scaled to this lower bound (default 0.)
max (float) – attributes are scaled to this lower bound (default 1.)

MaxAbsScaler class¶

class leapyear.feature.MaxAbsScaler(input_cols)¶

Scale the attributes by the max absolute value of the min and the max.

INT and REAL columns are scaled so that values smaller than the absolute value of the min or max fall between -1 and 1, with no shifting of the data.

Examples

Using BoundsAbsScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> bs = BoundsAbsScaler(['col1','col2'])
>>> ds2 = bs.fit_transform(ds1)

Parameters: input_cols (Sequence[str]) – the names of the input columns.

StandardScaler class¶

class leapyear.feature.StandardScaler(input_cols, with_mean=True, with_stdev=True)¶

Scale the attributes to be centered at zero with unit variance.

INT and REAL columns are scaled with removed mean and scaled to unit variance.

Examples

Using StandardScaler on columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> ss = StandardScaler(['col1','col2'], with_mean=True, with_stdev=False )
>>> ds2 = ss.fit_transform(ds1)

Parameters

input_cols (Sequence[str]) – the names of the input columns.
with_mean (bool) – Remove the mean from the attributes.
with_stdev (bool) – Scale the attribute to have unit standard deviation.

ScaleTransformModel class¶

class leapyear.feature.ScaleTransformModel(attr_lower_upper, lower, upper, scale_bool)¶

Scale attributes to new values.

Shift and scale each attribute so the 2 values associated with each attribute are mapped to the 2 values in new_values.

Normalizer class¶

class leapyear.feature.Normalizer(input_cols, p, suffix=<factory>)¶

Compute the p-norm of attributes and normalize by norm value.

Will only work on INT and REAL columns.

Examples

Using Normalizer on columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:

>>> norm_trn = Normalizer(['col1','col2'], p = 2)
>>> ds2 = norm_trn.fit_transform(ds1)

Parameters

input_cols (Sequence[str]) – The list of the input columns that needs to normalized.
p (int) – norm p
suffix (str) – Suffix appended to the name of the transformed attribute. Defaults to ‘_NORM’ for Snowflake and ‘_norm’ for Spark

fit_transform(dataset, **kwargs)¶

Normalize a set of attributes.

Computes p norm.

Return type: DataSet

Winsorizer class¶

class leapyear.feature.Winsorizer(input_col, lo_val, hi_val, suffix=<factory>)¶

Bound the non-null values of the attribute to be within lo_val and hi_val.

Will only work on non-nullable INT and REAL columns. Nullable columns can be converted to non-nullable with leapyear.dataset.Attribute.coalesce().

When transformed, returns the DataSet with an additional attribute that is winsorised between the given low and high value for the specified attribute.

Examples

Using Winsorizer on column ‘col1’ in Dataset ‘ds1’:

>>> wins = Winsorizer('col1', lo_val = 0, hi_val = 1)
>>> ds2 = wins.fit_transform(ds1)

Parameters

input_col (str) – The name of the input columns.
lo_val (float) – After transformation attribute will be greater than or equal to lo_val
hi_val (float) – After transformation attribute will be lesser than or equal to hi_val
suffix (str) – Suffix appended to the name of the transformed attribute. Defaults to ‘_WIN’ for Snowflake and ‘_win’ for Spark

fit_transform(dataset, **kwargs)¶

Winsorize the specified attribute.

Called after providing lo and hi vals.

Return type: DataSet

Bucketizer class¶

class leapyear.feature.Bucketizer(input_col, split_vals)¶

Quantize the attribute according to thresholds specified in split_vals.

attr < split_vals[0] -> bin 0
attr >= split_vals[0] and attr < split_vals[1] -> bin 1
...
attr >= split_vals[-1] -> bin len(split_vals)

Works with INT and REAL columns.

Examples

Using Bucketizer on column ‘col1’ in Dataset ‘ds1’:

>>> split_vals = [0, 0.25, 0.75]
>>> buck = Bucketizer('col1', split_vals = split_vals )
>>> ds2 = buck.fit_transform(ds1)

Parameters

input_col (str) – the names of the input column.
split_vals (Sequence[float]) – Thresholds for creating the bins

fit_transform(dataset, **kwargs)¶

For testing use only.

Return type: DataSet