Module leapyear.feature¶
Feature engineering classes.
OneHotEncoder class¶
-
class
leapyear.feature.
OneHotEncoder
(input_cols, max_size=32, drop_originals=True, drop_last=True)¶ One-hot encode attributes.
FACTOR and INT columns that can have less than max_size values are converted to BOOL columns indicating the presence of the value. Will only work with non-nullable columns. Nullable columns can be converted to non-nullable with
leapyear.dataset.Attribute.coalesce()
.The last category is not included by default, where the categories are sorted lexicographically based on their characters’ ASCII values.
Examples
Using OneHotEncoder on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> ohe = OneHotEncoder(['col1', 'col2'], drop_originals=True, max_size=64) >>> ds2 = ohe.transform(ds1)
- Parameters
input_cols (Sequence[str]) – the names of the input columns.
max_size (int) – maximum number of values to one hot encode per column. (default: 32)
drop_originals (bool) – drop columns not derived from the input columns. (default: True)
drop_last (bool) – drop the last column containing redundant information. (default: True)
BoundsScaler class¶
-
class
leapyear.feature.
BoundsScaler
(input_cols, lower=0.0, upper=1.0)¶ Scale the attributes by the bounds of the type.
BOOL, INT and REAL columns are scaled so all values fall between min and max (inclusive). In contrast to MinMaxScaler and StandardScaler, there is no privacy leakage using this class.
Examples
Using BoundsScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> bs = BoundsScaler(['col1','col2']) >>> ds2 = bs.fit_transform(ds1)
BoundsAbsScaler class¶
-
class
leapyear.feature.
BoundsAbsScaler
(input_cols)¶ Scale the attributes by the max absolute value of the type bounds.
INT and REAL columns are scaled so all values fall between -1 and 1, with no shifting of the data.
Examples
Using BoundsAbsScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> bs = BoundsAbsScaler(['col1','col2']) >>> ds2 = bs.fit_transform(ds1)
- Parameters
input_cols (Sequence[str]) – the names of the input columns.
MinMaxScaler class¶
-
class
leapyear.feature.
MinMaxScaler
(input_cols, min_=0.0, max_=1.0)¶ Scale the attributes by the min and max of the attribute.
BOOL, INT and REAL columns are scaled so all values fall between min and max (inclusive).
Examples
Using MinMaxScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> ms = MinMaxScaler(['col1','col2'], min_=0.0, max_=1.0) >>> ds2 = ms.fit_transform(ds1)
MaxAbsScaler class¶
-
class
leapyear.feature.
MaxAbsScaler
(input_cols)¶ Scale the attributes by the max absolute value of the min and the max.
INT and REAL columns are scaled so that values smaller than the absolute value of the min or max fall between -1 and 1, with no shifting of the data.
Examples
Using BoundsAbsScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> bs = BoundsAbsScaler(['col1','col2']) >>> ds2 = bs.fit_transform(ds1)
- Parameters
input_cols (Sequence[str]) – the names of the input columns.
StandardScaler class¶
-
class
leapyear.feature.
StandardScaler
(input_cols, with_mean=True, with_stdev=True)¶ Scale the attributes to be centered at zero with unit variance.
INT and REAL columns are scaled with removed mean and scaled to unit variance.
Examples
Using StandardScaler on columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> ss = StandardScaler(['col1','col2'], with_mean=True, with_stdev=False ) >>> ds2 = ss.fit_transform(ds1)
ScaleTransformModel class¶
-
class
leapyear.feature.
ScaleTransformModel
(attr_lower_upper, lower, upper, scale_bool)¶ Scale attributes to new values.
Shift and scale each attribute so the 2 values associated with each attribute are mapped to the 2 values in new_values.
Normalizer class¶
-
class
leapyear.feature.
Normalizer
(input_cols, p, suffix=<factory>)¶ Compute the p-norm of attributes and normalize by norm value.
Will only work on INT and REAL columns.
Examples
Using Normalizer on columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> norm_trn = Normalizer(['col1','col2'], p = 2) >>> ds2 = norm_trn.fit_transform(ds1)
- Parameters
-
fit_transform
(dataset, **kwargs)¶ Normalize a set of attributes.
Computes p norm.
- Return type
DataSet
Winsorizer class¶
-
class
leapyear.feature.
Winsorizer
(input_col, lo_val, hi_val, suffix=<factory>)¶ Bound the non-null values of the attribute to be within lo_val and hi_val.
Will only work on non-nullable INT and REAL columns. Nullable columns can be converted to non-nullable with
leapyear.dataset.Attribute.coalesce()
.When transformed, returns the DataSet with an additional attribute that is winsorised between the given low and high value for the specified attribute.
Examples
Using Winsorizer on column ‘col1’ in Dataset ‘ds1’:
>>> wins = Winsorizer('col1', lo_val = 0, hi_val = 1) >>> ds2 = wins.fit_transform(ds1)
- Parameters
input_col (str) – The name of the input columns.
lo_val (float) – After transformation attribute will be greater than or equal to lo_val
hi_val (float) – After transformation attribute will be lesser than or equal to hi_val
suffix (str) – Suffix appended to the name of the transformed attribute. Defaults to ‘_WIN’ for Snowflake and ‘_win’ for Spark
-
fit_transform
(dataset, **kwargs)¶ Winsorize the specified attribute.
Called after providing lo and hi vals.
- Return type
DataSet
Bucketizer class¶
-
class
leapyear.feature.
Bucketizer
(input_col, split_vals)¶ Quantize the attribute according to thresholds specified in split_vals.
attr < split_vals[0] -> bin 0 attr >= split_vals[0] and attr < split_vals[1] -> bin 1 ... attr >= split_vals[-1] -> bin len(split_vals)
Works with INT and REAL columns.
Examples
Using Bucketizer on column ‘col1’ in Dataset ‘ds1’:
>>> split_vals = [0, 0.25, 0.75] >>> buck = Bucketizer('col1', split_vals = split_vals ) >>> ds2 = buck.fit_transform(ds1)
- Parameters
-
fit_transform
(dataset, **kwargs)¶ For testing use only.
- Return type
DataSet