Module leapyear.analytics¶
Statistics and machine learning algorithms.
LeapYear analyses are functions that are executed by the server to compute
statistics or to perform machine learning tasks on DataSets
. These
functions return an Analysis
type, which is executed on the server by calling
the run()
method.
For simple statistics, such as count()
or mean()
, the values can be extracted
using the following pattern:
>>> from leapyear import Client, DataSet
>>> from leapyear.analytics import count_rows, mean
>>> client = Client(url='http://ly-server:4401', username='admin', password='password')
>>> dataset = DataSet.from_table('db.table')
>>> dataset_rows_analysis = count_rows(dataset)
>>> n_rows = dataset_rows_analysis.run()
>>> print(n_rows)
10473
>>> dataset_mean_x_analysis = mean('x0', dataset)
>>> mean_x = dataset_mean_x_analysis.run()
>>> print(mean_x)
5.234212346345
The computation of all univariate statistics follows the pattern for mean()
. For
more complicated machine learning tasks, multiple columns must be specified, depending on the
task.
Unsupervised learning tasks (like clustering) will generally require the specification
of which features in the DataSet
to use. Supervised learning tasks (like regression)
will additionally require the specification of a target variable.
For example, we can train a linear regression model as follows:
>>> from leapyear.analytics import generalized_linreg
>>> regression = generalized_linreg(['x0', 'x1'], 'y', dataset, affine=True, l2reg=1.0)
>>> model = regression.run()
Helper routines are available for performing cross-validation
(see cross_val_score_linreg()
). Note that, unlike other analyses,
they are immediately executed (without calling run()
):
>>> from leapyear.analytics import cross_val_score_linreg
>>> cross_val_score = cross_val_score_linreg(
>>> ['x0', 'x1'], 'y', dataset, cv=3,
>>> affine=True, l1reg=0.1, l2reg=1.0, scorer='mse'
>>> )
Data Analysis¶
-
leapyear.analytics.
count
(attr, dataset=None, drop_nulls=False, target_relative_error=None, max_budget=None)¶ Analysis: Count the elements of an attribute.
This analysis can be executed using the
run
method to compute the approximate count of elements, includingNULL
values.The user can request additional information about the computation with
run(rich_result=True)
. In this case, an object ofRandomizationInterval
, will be generated likely including the precise value of the computation on the data sample.- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the count of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to ignoreNULL
values. Default:False
.target_relative_error (
Optional
[float
]) – A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.max_budget (
Optional
[float
]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.
- Returns
Analysis object that can be executed using the
run
method.- Return type
-
leapyear.analytics.
count_rows
(dataset, target_relative_error=None, max_budget=None)¶ Analysis: Count the number of rows in a dataset.
This analysis can be executed using the
run
method to compute the approximate number of rows in the dataset.The user can request additional information about the computation with
run(rich_result=True)
. In this case, an object ofRandomizationInterval
will be generated, likely including the precise value of the computation on the data sample.- Parameters
dataset (
DataSet
) – The input dataset.target_relative_error (
Optional
[float
]) – A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.max_budget (
Optional
[float
]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.
- Returns
Analysis object that can be executed using the
run
method.- Return type
-
leapyear.analytics.
count_distinct
(attr, dataset=None, drop_nulls=False, target_relative_error=None, max_budget=None)¶ Analysis: Count the unique elements of an attribute.
- Parameters
attr (
Union
[Attribute
,str
,Sequence
[Union
[Attribute
,str
]]]) – The attribute or attributes to compute the distinct count of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Remove any records with null. Unique values associated with records containing nulls are not included in the count.target_relative_error (
Optional
[float
]) – A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.max_budget (
Optional
[float
]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.
- Returns
Prepared analysis of the count.
- Return type
-
leapyear.analytics.
count_distinct_rows
(dataset, target_relative_error=None, max_budget=None)¶ Analysis: Count the number of distinct rows in a dataset.
- Parameters
dataset (
DataSet
) – The input dataset.target_relative_error (
Optional
[float
]) – A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.max_budget (
Optional
[float
]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.
- Returns
Analysis for counting the number of distinct rows.
- Return type
-
leapyear.analytics.
mean
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the mean of an attribute.
This analysis can be executed using the
run
method to compute the approximate mean of the attribute.The user can request additional information about the computation with
run(rich_result=True)
. In this case, theRandomizationInterval
object will be generated, likely including the precise value of the computation on the data sample.Note: If the attribute is nullable, setting
drop_nulls=True
is necessary for the computation to go through.- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the mean of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoringNULL
values.
- Returns
Analysis object that can be executed using the
run
method.- Return type
ScalarAnalysisWithCI
-
leapyear.analytics.
sum
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the sum of a numeric attribute.
This analysis can be executed using the
run
method to compute the approximate mean of the attribute.The user can request additional information about the computation with
run(rich_result=True)
. In this case, theRandomizationInterval
object will be generated, likely including the precise value of the computation on the data sample.Note: If the attribute is nullable, setting
drop_nulls=True
is necessary for the computation to go through.- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the sum of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoringNULL
values.
- Returns
Analysis object that can be executed using the
run
method.- Return type
-
leapyear.analytics.
variance
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the variance of an attribute.
This analysis can be executed using the
run
method to compute the approximate mean of the attribute.The user can request additional information about the computation with
run(rich_result=True)
. In this case, theRandomizationInterval
object will be generated, likely including the precise value of the computation on the data sample.Note: If the attribute is nullable, setting
drop_nulls=True
is necessary for the computation to go through.- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the variance of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoringNULL
values.
- Returns
Analysis object that can be executed using the
run
method.- Return type
ScalarAnalysisWithCI
-
leapyear.analytics.
min
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the minimum value of an attribute.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the min of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the min.
- Return type
Note
The minimum reported is the 1/1000 quantile of the attribute.
When the attribute being analyzed has a very narrow range of possible values, the minimum returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than
0.01
, the minimum returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by10/width
, compute the minimum, and rescale the returned value bywidth/10
.The result of this analysis may be very different from the true minimum of the data sample in the following two scenarios:
1. When the underlying attribute distribution has significant outliers (e.g. a very long tail) - this is because the minimum computed is the 1/1000 quantile of the attribute, and
2. When the public lower bound is very different from the true minimum of the data sample - this is because differential privacy is aiming to minimize the effect of individual records on the output.
-
leapyear.analytics.
max
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the maximum value of an attribute.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the max of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the max.
- Return type
Note
The maximum reported is the 999/1000 quantile of the attribute.
When the attribute being analyzed has a very narrow range of possible values, the maximum returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than
0.01
, the maximum returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by10/width
, compute the maximum, and rescale the returned value bywidth/10
.The result of this analysis may be very different from the true maximum of the data sample in the following two scenarios:
1. When the underlying attribute distribution has significant outliers (e.g. a very long tail) - this is because the maximum computed is the 999/1000 quantile of the attribute, and
2. When the public upper bound is very different from the true maximum of the data sample - this is because differential privacy is aiming to minimize the effect of individual records on the output.
-
leapyear.analytics.
median
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the median value of an attribute.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the median of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the median.
- Return type
Note
When the attribute being analyzed has a very narrow range of possible values, the median returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than
0.01
, the median returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by10/width
, compute the median, and rescale the returned value bywidth/10
.
-
leapyear.analytics.
quantile
(q, attr, dataset=None, drop_nulls=False)¶ Analysis: Compute a certain quantile q of an attribute.
- Parameters
q (
float
) – Quantile to compute, which must be between 0 and 1 inclusive.attr (
Union
[Attribute
,str
]) – The attribute to compute the quantile of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the quantile.
- Return type
Note
When the attribute being analyzed has a very narrow range of possible values, the quantile returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than
0.01
, the quantile returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by10/width
, compute the quantile, and rescale the returned value bywidth/10
.
-
leapyear.analytics.
skewness
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the skewness of an attribute.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the skewness of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the skewness.
- Return type
-
leapyear.analytics.
kurtosis
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the excess kurtosis of an attribute.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the kurtosis of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the kurtosis.
- Return type
-
leapyear.analytics.
iqr
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the interquartile range of an attribute.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the interquartile range of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the iqr.
- Return type
-
leapyear.analytics.
histogram
(attr, dataset=None, bins=10, interval=None)¶ Analysis: Compute the histogram of the attribute in the dataset.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the histogram of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string.bins (
int
) – Number of bins between the bounds. (default=10)interval (
Optional
[Tuple
[float
,float
]]) – The lower and upper bound of the histogram. Defaults to attribute bounds if None.
- Returns
Prepared analysis of the histogram.
- Return type
-
leapyear.analytics.
histogram2d
(x_attr, y_attr, dataset=None, x_bins=10, y_bins=10, x_range=None, y_range=None)¶ Analysis: Compute the 2D histogram of two attributes in the dataset.
- Parameters
x_attr (
Union
[Attribute
,str
]) – The attribute to use to compute the first dimension of the histogram. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.y_attr (
Union
[Attribute
,str
]) – The attribute to use to compute the first dimension of the histogram. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when x_attr or y_attr are strings.x_bins (
int
) – Number of bins between the bounds in the first attribute.y_bins (
int
) – Number of bins between the bounds in the second attribute.x_range (
Optional
[Tuple
[float
,float
]]) – The lower and upper bound of the first attribute for the histogram.y_range (
Optional
[Tuple
[float
,float
]]) – The lower and upper bound of the second attribute for the histogram.
- Returns
Prepared analysis of the histogram.
- Return type
-
leapyear.analytics.
correlation_matrix
(xs, dataset, *, center=True, scale=True, **kwargs)¶ Analysis: Compute the correlation matrix of the set of attributes.
NOTE: This analysis does not require
run()
.- Parameters
xs (
Sequence
[str
]) – A list of attribute names to compute correlation matrix for.dataset (
DataSet
) – The DataSet containing these attributes.center (
bool
) – Whether to center the columns before computing correlation matrix. If False, proceed assuming the columns are already centered.scale (
bool
) – Whether to divide covariance matrix by number of rows. If False, do not divide.max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to
None
, this function will poll the server indefinitely. If it is run withscale
orcenter
set to True, the timeout will be multiplied. Defaults to waiting forever.
- Returns
The correlation matrix.
- Return type
np.ndarray
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
covariance_matrix
(xs, dataset, *, center=True, scale=True, **kwargs)¶ Analysis: Compute the covariance matrix of the set of attributes.
NOTE: This analysis does not require
run()
.- Parameters
xs (
Sequence
[str
]) – A list of attribute names that are the features.dataset (
DataSet
) – The DataSet of the attributes.center (
bool
) – Whether to center the columns before compute the covariance matrix. If False, assume the columns are centered.scale (
bool
) – Whether to divide the matrix by number of rows. If False, do not divide.max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to
None
, this function will poll the server indefinitely. If it is run withscale
orcenter
set to True, the timeout will be multiplied. Defaults to waiting forever.
- Returns
The covariance matrix.
- Return type
np.ndarray
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
describe
(dataset, attributes=None)¶ Describe the columns of the dataset for use in data exploration.
The describe function provides a way for an analyst to perform initial rough data exploration on a dataset. To get more accurate statistics, the individual functions
mean()
,count()
, et cetera, are recommended. This function does not use the analysis cache of the other statistics functions.Numeric columns are described by their count, mean, standard deviation, minimum, maximum and the quartiles. Categorical columns (factors and booleans) are described by their count, distinct count and frequency of the most frequent element.
- Parameters
- Returns
Prepared analysis for describing the dataset. Execute the analysis using the
run()
method.- Return type
-
leapyear.analytics.
groupby_agg_view
(dataset, attrs, agg_attr=None, agg_type=<GroupByAggType.COUNT: 1>, *, max_groupby_agg_keys=100000000, size_threshold=None, agg_attr_and_type=None)¶ Compute aggregate statistic within each group and output aggregate results.
Only groups with estimated size larger than
minimum_dataset_size
will be returned. This parameter can be set inrun
.- Parameters
dataset (
DataSet
) – The DataSet to perform groupby and aggregation onattrs (
Sequence
[Union
[Attribute
,str
]]) – List of attributes to perform groupby.agg_attr (
Optional
[str
]) – Compute aggregate statistics on this column within each groupagg_type (
Union
[GroupByAggType
,str
]) – Aggregate type. ‘count’, ‘mean’ or ‘sum’.max_groupby_agg_keys (
int
) – This value prevents submitting computations that have a very large number of groupby keys. By default, it raises GroupbyAggTooManyKeysError if the number of groups exceeds 100000000.size_threshold (
Optional
[int
]) – Deprecated: seeminimum_dataset_size
in therun
method.agg_attr_and_type (
Union
[Tuple
[Union
[GroupByAggType
,str
],Optional
[str
]],Sequence
[Tuple
[Union
[GroupByAggType
,str
],Optional
[str
]]],None
]) – List of tuples (agg_type, agg_attr). Compute aggregate statistics defined by the agg_type on the column within each group. agg_type can be ‘count’, ‘mean’ or ‘sum’.
- Returns
Analysis object that can be executed using
run
method to return aggregation results. The results can be accessed as a pandas dataframe using.to_dataframe()
.- Return type
Note: privacy exposure estimate for this analysis is not supported.
Example
For each age group and gender, compute the mean income.
>>> groupby_agg_view(ds, ["AGE", "GENDER"], "INCOME", "mean").run(minimum_dataset_size=1000)
For each week, compute the mean and total transaction amount.
>>> groupby_agg_view(ds, ["WEEK"], agg_attr_and_type=[("mean", "AMOUNT"), ("sum", "AMOUNT")]).run()
Look at Randomization Intervals for each group (only for ‘count’ and ‘sum’).
>>> rr = groupby_agg_view(ds, ["WEEK"], "AMOUNT", "mean").run(rich_result=True) >>> ri_dict = rr.metadata >>> ri_dict { (1, ): RandomizationInterval(...), (2, ): RandomizationInterval(...) ... } >>> ri_dict[(1, )] RandomizationInterval(...)
Look at Randomization Interval for multiple aggregate results.
>>> rr = groupby_agg_view(ds, ["YEAR", "WEEK"], agg_attr_and_type=[("mean", "AMOUNT"), ("sum", "AMOUNT")]).run() >>> ri_dict = rr.metadata >>> ri_dict[(2020, 1)][0] RandomizationInterval(...)
Machine Learning¶
Unsupervised learning¶
-
leapyear.analytics.
kmeans
(xs, dataset, n_iters=10, n_clusters=3)¶ Analysis: K-means clustering.
Identifies centers of clusters for a set of data points, by
Randomly initializing a chosen number of cluster centers (centroids) in the feature space
Associating each data point with the nearest centroid
Iteratively adjusting centroids to locations based on differentially private computation of the mean for each feature
- Parameters
- Returns
Analysis object that can be executed using the
run()
method. Once executed, it would output clustering analysis results, such as centroids.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
eval_kmeans
(centroids, xs, dataset)¶ Analysis: Evaluate the K-means model.
Evaluate the clustering model by computing the Normalized Intra Cluster Variance (NICV).
- Parameters
centroids (
ClusterModel
) – The model (generated using kmeans) to evaluatexs (
List
[str
]) – A list of attribute names that are the features.dataset (
DataSet
) – The DataSet of the attributes.
- Returns
Analysis representing evaluation of a clustering model. It can be executed using the
run()
method to output evaluation metric value.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
pca
(xs, dataset, **kwargs)¶ Principal Component Analysis.
Compute the Principal Component Analysis (PCA) of the set of attributes using a differentially private algorithm.
NOTE: This analysis does not require
run()
.- Parameters
xs (
List
[str
]) – A list of attribute names representing features to be considered for this analysis.dataset (
DataSet
) – DataSet that includes these attributes.max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to
None
, this function will poll the server indefinitely. If it is run withscale
orcenter
set to True, the timeout will be multiplied. Defaults to waiting forever.
- Return type
Tuple
[ndarray
,ndarray
]- Returns
explained_variances – Variance explained by each of the principal components - in other words, variance of each principal component coordinate when considered as feature on the input dataset.
pca_matrix – Transformation matrix, that can be used to translate original features to principal component coordinates. If all principal components are included, this becomes a square matrix corresponding to orthogonal transformation (e.g. reflection).
This matrix can be used to generate principal component features using
leapyear.dataset.DataSet.transform()
operation, as in:tfds = ds.transform(x_vars, pca_matrix, 'pca')
NOTE: Signs may not match PCA transformation matrix computed by scikit-learn.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
Supervised learning¶
-
leapyear.analytics.
basic_linreg
(xs, y, dataset, *, affine=True, l1reg=0.0, l2reg=1.0, parameter_bounds=None)¶ Analysis: Linear regression.
Implements a differentially private algorithm to represent outcome (target) variable as a linear combination of selected features.
Note
To help ensure that the differentially private training process can effectively optimize regression coefficients, it’s important to re-scale features (both dependent/target and independent/explanatory) to a similar domain (e.g. [0,1]). This can be done using
leapyear.feature.BoundsScaler
and will help ensure that the domain being searched for the coefficient will include the optimal model. See LeapYear guides for this and other recommendations on training accurate regressions.Note
Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.
Please see the guides for more details on using regressions in LeapYear.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.affine (
bool
) – IfTrue
, fit an intercept term.l1reg (
float
) – The L1 regularization. Default value:0.0
.l2reg (
float
) – The L2 regularization. Default value:1.0
. Must be at least0.0001
to limit the randomization effect for models optimized via objective perturbation.parameter_bounds (
Optional
[List
[Tuple
[float
,float
]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value:-10.0 to 10.0 for each parameter
- Returns
Analysis representing the regression problem. It can be executed using the
run()
method to output calibrated model.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
generalized_linreg
(xs, y, dataset, *, affine=True, l2reg=1.0, weight=None, offset=None, max_iters=25, family='gaussian', link='identity', link_power=0, variance_power=1)¶ Analysis: Generalized linear regression.
Implements a differentially private algorithm to represent outcome (target) variable as a linear combination of selected features. This computation is based on the iterative weighted least squares algorithm. Trains using the “glm” algorithm.
Available generalizations include:
offset of outputs based on pre-existing model - this enables modeling of residual
use of alternative link functions applied to the linear combination of features
application of regularization and weights during model optimization.
Note
Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.
Please see the guides for more details on using regressions in LeapYear.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list ofAttributes
or attribute names to be used as features.y (
Union
[Attribute
,str
]) – TheAttribute
or attribute name of the target.dataset (
DataSet
) – The DataSet of the attributes.affine (
bool
) –True
if the algorithm should fit an intercept term.l2reg (
float
) – The L2 regularization parameter. Must be non-negative.weight (
Union
[Attribute
,str
,None
]) – Optional column used to weight each sample. Implies generalized regression.offset (
Union
[Attribute
,str
,None
]) – Optional column for offset in offset regression. Implies generalized regression.max_iters (
Optional
[int
]) – Optional maximum number of iterations for fitting the regression. Note that regardless of this setting, the system would often stop before reaching max_iterations - e.g. after a single iteration. In such cases, higher value for max_iterations may lead to less privacy allocated to each iteration, and ultimately, higher randomization effect.family (
Optional
[str
]) – Optional distribution of the label. Implies generalized regression. Possible values here are ‘gaussian’ (the default), ‘poisson’, ‘gamma’ and ‘tweedie’.link (
Optional
[str
]) – Optional link function between mean of label distribution and prediction. Implies generalized regression. Possible values depend on family: ‘gaussian’ supports only ‘identity’ (default), ‘log’ and ‘inverse’; ‘poisson’ supports only ‘log’ (default), ‘identity’ and ‘sqrt’; ‘gamma’ supports only ‘inverse’ (default), ‘identity’ and ‘log’. There is no link function for the ‘tweedie’ family, use variance_power and link_power parameters instead.link_power (
Optional
[int
]) – For the ‘tweedie’ distribution only, the exponent of the link function. Default value is 0, which is equivalent to ‘identity’ link.variance_power (
Optional
[int
]) – For the ‘tweedie’ distribution only, the exponent of the variance. Default value is 1, which is equivalent to ‘gaussian’ family.
- Returns
Analysis of the regression problem, which could be executed using
run()
function to output calibrated model.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
logreg
(xs, y, dataset, affine=True, l1reg=0.0, l2reg=1.0)¶ Analysis: Logistic regression.
Implements a differentially private algorithm to represent outcome (target) variable as a logit-transformation of a linear combination of selected features. Trains using the “basic” algorithm.
Available generalizations include
regularization applied during model optimization
Note
To help ensure that the differentially private training process can effectively optimize regression coefficients, it’s important to re-scale features (both dependent/target and independent/explanatory) to a similar domain (e.g. [0,1]). This can be done using
leapyear.feature.BoundsScaler
and will help ensure that the domain being searched for the coefficient will include the optimal model. See LeapYear guides for this and other recommendations on training accurate regressions.Note
Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.
Please see the guides for more details on using regressions in LeapYear.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.affine (
bool
) – IfTrue
, fit an intercept term.l1reg (
float
) – The L1 regularization. Default value:0.0
.l2reg (
float
) – The L2 regularization. Default value:1.0
.
- Returns
Analysis training the logistic regression model. It can be executed using the
run()
method to output the calibrated model.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
generalized_logreg
(xs, y, dataset, *, affine=True, l1reg=1.0, l2reg=1.0, weight=None, offset=None, max_iters=25, link='logit')¶ Analysis: Generalized logistic regression.
Implements a differentially private algorithm to represent the outcome (target) variable as a logit-transformation of a linear combination of selected features. This computation is based on the iterative weighted least squares algorithm. Trains using the “glm” algorithm.
Available generalizations include:
offset of outputs based on pre-existing model - this enables modeling of residual
use of alternative link functions applied to the linear combination of features
application of regularization and weights during model optimization.
Note
Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.
Please see the guides for more details on using regressions in LeapYear.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list ofAttributes
or attribute names to be used as features.y (
Union
[Attribute
,str
]) – TheAttribute
or attribute name of the target.dataset (
DataSet
) – The DataSet of the attributes.affine (
bool
) –True
if the algorithm should fit an intercept term.l2reg (
float
) – The L2 regularization parameter. Must be non-negative.weight (
Union
[Attribute
,str
,None
]) – Optional column used to weight each sample. Implies generalized regression.offset (
Union
[Attribute
,str
,None
]) – Optional column for offset in offset regression. Implies generalized regression.max_iters (
Optional
[int
]) – Optional maximum number of iterations for fitting the regression. Note that regardless of this setting, the system would often stop before reaching max_iterations - e.g. after a single iteration. In such cases, higher value for max_iterations may lead to less privacy allocated to each iteration, and ultimately, higher randomization effect.link (
Optional
[str
]) – Optional link function between mean of label distribution and prediction. Implies generalized regression. Possible values are ‘logit’ (default), ‘probit’ and ‘cloglog’.
- Returns
Analysis of the regression problem, which could be executed using
run()
function to output calibrated model.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
gradient_boosted_tree_classifier
(xs, y, dataset, max_depth=3, max_iters=5, max_bins=32)¶ Analysis: Gradient boosted tree classifier.
This analysis trains a randomized variant of gradient boosted tree classifier to predict a BOOLEAN outcome (target).
The algorithm works by iteratively training individual decision trees to predict a “residual” of the model built so far, and then integrating each newly built decision tree into the ensemble model to better predict the probability of the positive label.
Weights are used at different stages:
during training of individual decision trees, to focus attention on the areas where the model consistently underperforms, and
when combining individual decision trees to predict probability of the positive label.
Calibrated level of randomization is applied to individual leaves of the decision trees to help protect privacy of the individual records used for model training.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attributes or attribute names that are used as explanatory features for the analysis. Each attribute must be eitherBOOL
,INT
,REAL
orFACTOR
. Nullable types are not supported and must be converted to non-nullable - e.g. viacoalesce
.y (
Union
[Attribute
,str
]) – The attribute or attribute name that is used as an outcome (target) of the classification model. Must be BOOLEAN type, as only binary classification models are supported. Nullable types are not supported and must be converted to non-nullable - e.g. viacoalesce
.dataset (
DataSet
) – The DataSet containing both explanatory features and outcome attributes.max_depth (
int
) – The maximum depth (or height) of any tree in the ensemble produced by the algorithm. Default: 3max_iters (
int
) – The maximum number of iterations of the algorithm. This corresponds to the maximum number of individual decision trees in the ensemble. Default: 5max_bins (
int
) –The maximum number of bins for features used in constructing trees. Default: 32
Note
Maximum number of bins should be set to no less than the number of distinct possible values of the FACTOR attributes used as explanatory features.
- Returns
Analysis that will train the gradient boosted tree classifier. It can be executed using the
run()
method.- Return type
See also
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
random_forest
(xs, y, dataset, n_trees=100, height=3)¶ Analysis: Random Forest Classifier.
Generate a random forest model to predict probability associated with each target class.
Random forests combine many decision trees in order to reduce the risk of overfitting.
Each decision tree is developed on a random subset of observations - and is limited to prescribed height.
Individual node split decisions are made to maximize split value (or gain) - with a variation that a differentially private algorithm is used to count the number of observations belonging to each target class on both sides of the split.
Specifically, split value (or gain) is defined as reduction in combined Gini impurity measure, associated with introducing the split for a given parent node. Here
Gini impurity for any given node (parent or child) is calculated based on distribution of observations within the node across different outcome (target) classes
To compute combined impurity of the pair of nodes, individual node impurities for the two children nodes are averaged proportionately to their share of observations
Categorical features are typically handled by evaluating various splits corresponding to random subsets of the available categories.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are used as features for explanatory analysis.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome (target).dataset (
DataSet
) – The DataSet containing both explanatory features and outcome attributes.n_trees (
int
) – The number of trees to use in the random forest. Default: 100height (
int
) – The maximum height of the trees. Default: 3
- Returns
Analysis training the random forest model. It can be executed using the
run()
method to output the analysis results which include the calibrated random forest model, feature importance statistics, etc.- Return type
See also
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
regression_trees
(xs, y, dataset, n_trees=100, height=3)¶ Analysis: Random Forest Regressor (regression trees).
Generate a regression trees model to predict value of target variable.
Regression trees are built similarly to random forests, but instead of predicting the probability that the target variable takes a certain categorical value (i.e., classification), they predict a real value of the target variable (i.e., regression).
The impurity metric in this case is the variance of the target variable for the datapoints that fall into the current node’s partition.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are used as features for explanatory analysis.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome (target).dataset (
DataSet
) – The DataSet containing both explanatory features and target attribute.n_trees (
int
) – The number of trees to use in the random forest. Default: 100.height (
int
) – The maximum height of the trees. Default: 3.
- Returns
Analysis training the regression trees model. It can be executed using the
run()
method to output the analysis results which include the calibrated random forest model, feature importance statistics, etc.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
eval_linreg
(glm, xs, y, dataset, metric='mse')¶ Analysis: Evaluate a linear regression model.
- Parameters
glm (
GLM
) – The model (generated using linreg) to evaluatexs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.Linear regression evaluation metric: ‘mse’/’mean_squared_error’ or ‘mae’/’mean_absolute_error’.
Note
During the calculation of
mse
andmae
metrics, the individual values of absolute error are restricted to be no greater than the length of the interval of possible values of the target attribute, as seen in the dataset schema. For example, if the target attribute contains values in the interval [-50, 50], then the absolute error of any individual prediction used in computing the mean will be no greater than 100.
- Returns
Analysis representing evaluation of a regression model. It can be executed using the
run()
method to output evaluation metric value.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
eval_logreg
(glm, xs, y, dataset, metric='accuracy')¶ Analysis: Evaluate a logistic regression model.
- Parameters
glm (
GLM
) – The model (generated using logreg) to evaluatexs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.metric (
Union
[str
,Metric
]) – Logistic regression evaluation metric. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’.
- Returns
Analysis representing evaluation of a logistic regression model. It can be executed using the
run()
method to output evaluation metric value.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
eval_gbt_classifier
(gbt, xs, y, dataset, metric='accuracy')¶ Analysis: Evaluate a gradient boosted tree (GBT) classifier model.
- Parameters
gbt (
GradientBoostedTreeClassifier
) – The model to evaluate.xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.metric (
Union
[str
,Metric
]) – GBT evaluation metric. Currently only supports ‘accuracy’.
- Returns
Analysis representing evaluation of a GBT classifier model. It can be executed using the
run()
method to output the value of the evaluation metric.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
eval_random_forest
(rf, xs, y, dataset, metric='accuracy')¶ Analysis: Evaluate a random forest model.
- Parameters
rf (
RandomForestClassifier
) – The model (generated using random_forest) to evaluatexs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.metric (
Union
[str
,Metric
]) – Forest evaluation metric. Examples: ‘logloss’, ‘accuracy’, ‘auroc’, ‘aupr’
- Returns
Analysis representing evaluation of a random forest model. It can be executed using the
run()
method to output evaluation metric value.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
eval_regression_trees
(rf, xs, y, dataset, metric='mse')¶ Analysis: Evaluate a regression trees model.
- Parameters
rf (
RandomForestClassifier
) – The model (generated using regression_trees) to evaluatexs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.Model evaluation metric. Examples: ‘mse’/’mean_squared_error’ or ‘mae’/’mean_absolute_error’.
Note
During the calculation of
mse
andmae
metrics, the individual values of absolute error are restricted to be no greater than the length of the interval of possible values of the target attribute, as seen in the dataset schema. For example, if the target attribute contains values in the interval [-50, 50], then the absolute error of any individual prediction used in computing the mean will be no greater than 100.
- Returns
Analysis representing evaluation of a regression trees model. It can be executed using the
run()
method to output evaluation metric value.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
roc
(model, xs, y, dataset, thresholds=5)¶ Compute the ConfusionCurves.
For each threshold value, compute the normalized confusion matrix using the model. The confusion matrix contains the true positive rate, the true negative rate, the false positive rate and the false negative rate.
- Parameters
model (
Union
[GLM
,RandomForestClassifier
]) – The model to evaluate the confusion curves on.xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.thresholds (
Union
[int
,Sequence
[float
]]) – If int, then generate approximately thresholds (rounded to the closest power of 2) number of thresholds using recursive medians. If a sequence of floats, then use the list as the thresholds.
- Returns
Analysis of the confusion curve, which can be executed using the
run()
method to output various evaluation metrics.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
cross_val_score_linreg
(xs, y, dataset, *, affine=True, l1reg=1.0, l2reg=1.0, cv=3, metric='mean_squared_error', parameter_bounds=None)¶ Analysis: Compute the linear regression cross validation score of the set of attributes.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.affine (
bool
) – IfTrue
, fit an intercept term.l1reg (
float
) – The L1 regularization. Default value:1.0
.l2reg (
float
) – The L2 regularization. Default value:1.0
. Must be at least0.0001
to limit the randomization effect.cv (
int
) – Number of folds in k-fold cross validation.metric (
Union
[str
,Metric
]) – The metric for evaluating the regression. Examples: ‘mae’, ‘mse’, ‘r2’.parameter_bounds (
Optional
[List
[Tuple
[float
,float
]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value:-10.0 to 10.0 for each parameter
- Returns
Analysis of the cross-validation scores for the regression model. It can be executed using the
run()
method to generate cross-validation results.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
cross_val_score_logreg
(xs, y, dataset, cv=3, affine=True, l1reg=0.1, l2reg=0.1, metric='accuracy')¶ Analysis: Compute the logistic regression cross validation score of the set of attributes.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.affine (
bool
) – IfTrue
, fit an intercept term.l1reg (
float
) – The L1 regularization. Default value:0.1
.l2reg (
float
) – The L2 regularization. Default value:0.1
. Must be at least0.0001
to limit the randomization effect.cv (
int
) – Number of folds in k-fold cross validation.metric (
Union
[str
,Metric
]) – The metric for evaluating the logistic regression. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’.
- Returns
Analysis of the cross-validation scores for the regression model. It can be executed using the
run()
method to generate cross-validation results.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
cross_val_score_random_forest
(xs, y, dataset, n_trees=100, height=3, cv=3, metric='mean_squared_error')¶ Analysis: Compute the random forest cross validation score of the set of attributes.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.n_trees (
int
) – Number of trees.height (
int
) – Maximum height of trees.cv (
int
) – Number of folds in k-fold cross validationmetric (
Union
[str
,Metric
]) – The metric for evaluating the regression
- Returns
Analysis of the cross-validation scores for the random forest model. It can be executed using the
run()
method to generate cross-validation results.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
cross_val_score_regression_trees
(xs, y, dataset, n_trees=100, height=3, cv=3, metric='mean_squared_error')¶ Analysis: Compute the regression trees cross validation score of the set of attributes.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.n_trees (
int
) – Number of trees.height (
int
) – Maximum height of trees.cv (
int
) – Number of folds in k-fold cross validationmetric (
Union
[str
,Metric
]) – The metric for evaluating the regression
- Returns
Analysis of the cross-validation scores for the regression trees model. It can be executed using the
run()
method to generate cross-validation results.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
hyperopt_linreg
(xs, y, dataset, *, cv, train_fraction, metric, n_iter=100, l1_bounds=(1e-10, 10000000000.0), l2_bounds=(1e-10, 10000000000.0), fit_intercept=None, parameter_bounds=None)¶ Analysis: Hyperparameter optimization for linear regression.
Calibrate a linear regression model by optimizing its cross-validation score with respect to model hyperparameters - L_1 and L_2 regularization parameters and presence of intercept.
See below for a pseudo-code of the algorithm:
Split the dataset into ds_train_val/ds_holdout based on train_fraction. Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val). Initialize cv_history = [] For 1..n_iter pick a set of hyperparameters (hp) to test based on cv_history. use hp to calibrate a model on each cross-validation set evaluate it on corresponding sample set-aside for cross-validation compute an average cv score and append it to cv_history. Pick the hyper parameters with the best cv score. Train a model using the complete ds_train_val data set. Evaluate the model on the holdout data set. Return resulting model and its performance on the holdout set.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attributes that are the features.dataset (
DataSet
) – The dataset containing the attributes.cv (
int
) – The number of cross-validation steps to perform for each candidate set of hyperparameters.train_fraction (
float
) – The fraction of the dataset to use set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.metric (
Union
[str
,Metric
]) – Model performance metric to optimize. Examples: ‘mean_squared_error’, ‘mean_absolute_error’, ‘r2’n_iter (
int
) – The number of optimization steps. Default: 100l1_bounds (
Tuple
[float
,float
]) – Lower and upper bounds for l1 regularization. Default: (1E-10, 1E10)l2_bounds (
Tuple
[float
,float
]) – Lower and upper bounds for l2 regularization. Default: (1E-10, 1E10)fit_intercept (
Optional
[bool
]) – IfNone
, search will consider both options.parameter_bounds (
Optional
[List
[Tuple
[float
,float
]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value:-10.0 to 10.0 for each parameter
- Returns
Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the
run()
method to output the analysis results, includingmodel calibrated with recommended hyperparameters and
its performance on the holdout dataset.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
hyperopt_logreg
(xs, y, dataset, cv, train_fraction, metric, n_iter=100, l1_bounds=(1e-10, 10000000000.0), l2_bounds=(1e-10, 10000000000.0), fit_intercept=None)¶ Analysis: Hyperparameter optimization for logistic regression.
Calibrate a logistic regression model by optimizing its cross-validation score with respect to model hyperparameters - L_1 and L_2 regularization parameters and presence of intercept.
See below for a pseudo-code of the algorithm:
Split the dataset into ds_train_val/ds_holdout based on train_fraction. Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val). Initialize cv_history = [] For 1..n_iter pick a set of hyperparameters (hp) to test based on cv_history. use hp to calibrate a model on each cross-validation set evaluate it on corresponding sample set-aside for cross-validation compute an average cv score and append it to cv_history. Pick the hyper parameters with the best cv score. Train a model using the complete ds_train_val data set. Evaluate the model on the holdout data set. Return resulting model and its performance on the holdout set.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attributes that are the features.dataset (
DataSet
) – The dataset containing the attributes.cv (
int
) – The number of cross-validation steps to perform for each candidate set of hyperparameters.train_fraction (
float
) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.metric (
Union
[str
,Metric
]) – Model performance metric to optimize. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’.n_iter (
int
) – The number of optimization steps. Default: 100l1_bounds (
Tuple
[float
,float
]) – Lower and upper bounds for l1 regularization. Default: (1E-10, 1E10)l2_bounds (
Tuple
[float
,float
]) – Lower and upper bounds for l2 regularization. Default: (1E-10, 1E10)fit_intercept (
Optional
[bool
]) – IfNone
, search will consider both options.
- Returns
Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the
run()
method to output the analysis results, includingmodel calibrated with recommended hyperparameters and
its performance on the holdout dataset.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
hyperopt_rf
(xs, y, dataset, cv, train_fraction, metric, n_iter=100, max_trees=1000, max_depth=20)¶ Analysis: Hyperparameter optimization for a random forest model.
Calibrate a random forest model by optimizing its cross-validation score with respect to model hyperparameters - number of trees and individual tree depth (or height) limit.
See below for a pseudo-code of the algorithm:
Split the dataset into ds_train_val/ds_holdout based on train_fraction. Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val). Initialize cv_history = [] For 1..n_iter pick a set of hyperparameters (hp) to test based on cv_history. use hp to calibrate a model on each cross-validation set evaluate it on corresponding sample set-aside for cross-validation compute an average cv score and append it to cv_history. Pick the hyper parameters with the best cv score. Train a model using the complete ds_train_val data set. Evaluate the model on the holdout data set. Return resulting model and its performance on the holdout set.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attributes that are the features.dataset (
DataSet
) – The dataset containing the attributes.cv (
int
) – The number of cross-validation steps to perform for each candidate set of hyperparameters.train_fraction (
float
) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.metric (
Union
[str
,Metric
]) – The metric to optimize. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’n_iter (
int
) – The number of optimization steps. Default: 100max_trees (
int
) – Maximum number of trees. Default: 1000max_depth (
int
) – Maximum tree depth. Default: 20
- Returns
Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the
run()
method to output the analysis results, includingmodel calibrated with recommended hyperparameters and
its performance on the holdout dataset.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
hyperopt_regression_trees
(xs, y, dataset, cv, train_fraction, metric, n_iter=100, max_trees=1000, max_depth=20)¶ Analysis: Hyperparameter optimization for a regression trees model.
Calibrate a regression trees model by optimizing its cross-validation score with respect to model hyperparameters - number of trees and individual tree depth (or height) limit.
See below for a pseudo-code of the algorithm:
Split the dataset into ds_train_val/ds_holdout based on train_fraction. Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val). Initialize cv_history = [] For 1..n_iter pick a set of hyperparameters (hp) to test based on cv_history. use hp to calibrate a model on each cross-validation set evaluate it on corresponding sample set-aside for cross-validation compute an average cv score and append it to cv_history. Pick the hyper parameters with the best cv score. Train a model using the complete ds_train_val data set. Evaluate the model on the holdout data set. Return resulting model and its performance on the holdout set.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attributes that are the features.dataset (
DataSet
) – The dataset containing the attributes.cv (
int
) – The number of cross-validation steps to perform for each candidate set of hyperparameters.train_fraction (
float
) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.metric (
Union
[str
,Metric
]) – The metric to optimize. Examples: ‘mae’, ‘mse’, ‘r2’n_iter (
int
) – The number of optimization steps. Default: 100max_trees (
int
) – Maximum number of trees. Default: 1000max_depth (
int
) – Maximum tree depth. Default: 20
- Returns
Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the
run()
method to output the analysis results, includingmodel calibrated with recommended hyperparameters and
its performance on the holdout dataset.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
Context Managers¶
-
leapyear.analytics.
ignore_computation_cache
()¶ Temporary context where computations do not utilize the computation cache.
The computation cache is intended to prevent wasting privacy exposure on queries that were previously computed. Entering this context manager will disable the use of the cache and allow repeated computations to return different differentially private answers.
Example
An administrator wants to run a count multiple times to estimate the random distribution of responses around the precise value.
>>> with ignore_computation_cache(): >>> results = [la.count_rows(table).run() for _ in range(10)]
See also
Note
Additional permissions may be required to disable the computation cache.
- Return type
-
leapyear.analytics.
precise_computations
(precise=True)¶ Temporary context specifying if the computations are precise or not.
Computations requested within this context would be executed in precise mode, where differential privacy is not applied.
- Parameters
precise (
bool
) – True to enable precise computations within the context, False to disable them.
Example
An administrator wants to compare the responses of a number of computations with and without differential privacy applied. Precise mode may not be available for all computations.
>>> def my_computation(): >>> symbols = ("APPL", "GOOG", "MSFT"): >>> return [la.count_rows(table.where(col("SYM") == lit(val)).run() for val in symbols] >>> >>> res_dp = my_computation() >>> with precise_computations(): >>> res_no_dp = my_computation()
See also
Note
Additional permissions may be necessary to enable precise computations.
- Return type
Save/Load Models¶
LeapYear save and load machine learning models utilities.
-
leapyear.ml_import_export.
save
(model, path_or_fd)¶ Save machine learning models in json to either a file or a file-like object.
- Parameters
model (
Union
[ClusterModel
,GLM
,GradientBoostedTreeClassifier
,RandomForestClassifier
,RandomForestRegressor
,RichResult
[Union
[ClusterModel
,GLM
,GradientBoostedTreeClassifier
,RandomForestClassifier
,RandomForestRegressor
],Any
]]) – Any machine learning model executed using therun()
method.path – The path where to save the file in the file system or a descriptor for an in-memory stream.
Example
>>> from leapyear.ml_import_export import save >>> save(model, 'model.json')
- Return type
-
leapyear.ml_import_export.
load
(path_or_fd, expected_model_type=None, **kwargs)¶ Load machine learning models from a file-like object.
- Parameters
path – The path in the file system or an in-memory stream from where to load the model.
expected_mode_type – If None it won’t check that the model being loaded is of the type specified. Otherwise it checks that the model loaded is of the type expected.
rf_type – When loading RandomForest models with serialization number 0, setting this to “classification” or “regression” will load the model as a RandomForestClassifier or RandomForestRegressor objects, respectively. If not specified, a RandomForest model will raise an error. The value is ignored for all other model types.
Examples
Loading a previously saved model of unspecified type
>>> from leapyear.ml_import_export import load >>> model = load('model.json')
Loading a previously saved RandomForestClassifier model
>>> from leapyear.ml_import_export import load >>> model = load('random_forest_classifier.json', RandomForestClassifier)
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Union
[ClusterModel
,GLM
,GradientBoostedTreeClassifier
,RandomForestClassifier
,RandomForestRegressor
]