Module leapyear.analytics<a class="headerlink" href="#module-leapyear.analytics" title="Permalink to this headline">¶

leapyear.analytics.count_distinct_rows(dataset, target_relative_error=None, max_budget=None)¶

Analysis: Count the number of distinct rows in a dataset.

Parameters

dataset (DataSet) – The input dataset.
target_relative_error (Optional[float]) – A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.
max_budget (Optional[float]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.

Returns

Analysis for counting the number of distinct rows.

Return type

leapyear.analytics.mean(attr, dataset=None, drop_nulls=False)¶

Analysis: Compute the mean of an attribute.

This analysis can be executed using the run method to compute the approximate mean of the attribute.

The user can request additional information about the computation with run(rich_result=True). In this case, the RandomizationInterval object will be generated, likely including the precise value of the computation on the data sample.

Note: If the attribute is nullable, setting drop_nulls=True is necessary for the computation to go through.

Parameters

attr (Union[Attribute, str]) – The attribute to compute the mean of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.
drop_nulls (bool) – Whether to allow running on a nullable column, ignoring NULL values.

Returns

Analysis object that can be executed using the run method.

Return type

ScalarAnalysisWithCI

leapyear.analytics.sum(attr, dataset=None, drop_nulls=False)¶

Analysis: Compute the sum of a numeric attribute.

This analysis can be executed using the run method to compute the approximate mean of the attribute.

The user can request additional information about the computation with run(rich_result=True). In this case, the RandomizationInterval object will be generated, likely including the precise value of the computation on the data sample.

Note: If the attribute is nullable, setting drop_nulls=True is necessary for the computation to go through.

Parameters

attr (Union[Attribute, str]) – The attribute to compute the sum of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.
drop_nulls (bool) – Whether to allow running on a nullable column, ignoring NULL values.

Returns

Analysis object that can be executed using the run method.

Return type

ScalarAnalysisWithRI

leapyear.analytics.variance(attr, dataset=None, drop_nulls=False)¶

Analysis: Compute the variance of an attribute.

This analysis can be executed using the run method to compute the approximate mean of the attribute.

The user can request additional information about the computation with run(rich_result=True). In this case, the RandomizationInterval object will be generated, likely including the precise value of the computation on the data sample.

Note: If the attribute is nullable, setting drop_nulls=True is necessary for the computation to go through.

Parameters

attr (Union[Attribute, str]) – The attribute to compute the variance of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.
drop_nulls (bool) – Whether to allow running on a nullable column, ignoring NULL values.

Returns

Analysis object that can be executed using the run method.

Return type

ScalarAnalysisWithCI

leapyear.analytics.min(attr, dataset=None, drop_nulls=False)¶

Analysis: Compute the minimum value of an attribute.

Parameters

attr (Union[Attribute, str]) – The attribute to compute the min of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.
drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the min.

Return type

Note

The minimum reported is the 1/1000 quantile of the attribute.

When the attribute being analyzed has a very narrow range of possible values, the minimum returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than 0.01, the minimum returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by 10/width, compute the minimum, and rescale the returned value by width/10.

The result of this analysis may be very different from the true minimum of the data sample in the following two scenarios:

1. When the underlying attribute distribution has significant outliers (e.g. a very long tail) - this is because the minimum computed is the 1/1000 quantile of the attribute, and

2. When the public lower bound is very different from the true minimum of the data sample - this is because differential privacy is aiming to minimize the effect of individual records on the output.

leapyear.analytics.max(attr, dataset=None, drop_nulls=False)¶

Analysis: Compute the maximum value of an attribute.

Parameters

attr (Union[Attribute, str]) – The attribute to compute the max of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.
drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the max.

Return type

Note

The maximum reported is the 999/1000 quantile of the attribute.

When the attribute being analyzed has a very narrow range of possible values, the maximum returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than 0.01, the maximum returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by 10/width, compute the maximum, and rescale the returned value by width/10.

The result of this analysis may be very different from the true maximum of the data sample in the following two scenarios:

1. When the underlying attribute distribution has significant outliers (e.g. a very long tail) - this is because the maximum computed is the 999/1000 quantile of the attribute, and

2. When the public upper bound is very different from the true maximum of the data sample - this is because differential privacy is aiming to minimize the effect of individual records on the output.

leapyear.analytics.median(attr, dataset=None, drop_nulls=False)¶

Analysis: Compute the median value of an attribute.

Parameters

attr (Union[Attribute, str]) – The attribute to compute the median of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.
drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the median.

Return type

Note

When the attribute being analyzed has a very narrow range of possible values, the median returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than 0.01, the median returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by 10/width, compute the median, and rescale the returned value by width/10.

leapyear.analytics.quantile(q, attr, dataset=None, drop_nulls=False)¶

Analysis: Compute a certain quantile q of an attribute.

Parameters

q (float) – Quantile to compute, which must be between 0 and 1 inclusive.
attr (Union[Attribute, str]) – The attribute to compute the quantile of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.
drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the quantile.

Return type

Note

When the attribute being analyzed has a very narrow range of possible values, the quantile returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than 0.01, the quantile returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by 10/width, compute the quantile, and rescale the returned value by width/10.

leapyear.analytics.skewness(attr, dataset=None, drop_nulls=False)¶

Analysis: Compute the skewness of an attribute.

Parameters

attr (Union[Attribute, str]) – The attribute to compute the skewness of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.
drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the skewness.

Return type

leapyear.analytics.kurtosis(attr, dataset=None, drop_nulls=False)¶

Analysis: Compute the excess kurtosis of an attribute.

Parameters

attr (Union[Attribute, str]) – The attribute to compute the kurtosis of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.
drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the kurtosis.

Return type

leapyear.analytics.iqr(attr, dataset=None, drop_nulls=False)¶

Analysis: Compute the interquartile range of an attribute.

Parameters

attr (Union[Attribute, str]) – The attribute to compute the interquartile range of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.
drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the iqr.

Return type

leapyear.analytics.histogram(attr, dataset=None, bins=10, interval=None)¶

Analysis: Compute the histogram of the attribute in the dataset.

Parameters

attr (Union[Attribute, str]) – The attribute to compute the histogram of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
dataset (Optional[DataSet]) – The dataset to use when attr is a string.
bins (int) – Number of bins between the bounds. (default=10)
interval (Optional[Tuple[float, float]]) – The lower and upper bound of the histogram. Defaults to attribute bounds if None.

Returns

Prepared analysis of the histogram.

Return type

leapyear.analytics.histogram2d(x_attr, y_attr, dataset=None, x_bins=10, y_bins=10, x_range=None, y_range=None)¶

Analysis: Compute the 2D histogram of two attributes in the dataset.

Parameters

x_attr (Union[Attribute, str]) – The attribute to use to compute the first dimension of the histogram. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
y_attr (Union[Attribute, str]) – The attribute to use to compute the first dimension of the histogram. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.
dataset (Optional[DataSet]) – The dataset to use when x_attr or y_attr are strings.
x_bins (int) – Number of bins between the bounds in the first attribute.
y_bins (int) – Number of bins between the bounds in the second attribute.
x_range (Optional[Tuple[float, float]]) – The lower and upper bound of the first attribute for the histogram.
y_range (Optional[Tuple[float, float]]) – The lower and upper bound of the second attribute for the histogram.

Returns

Prepared analysis of the histogram.

Return type

leapyear.analytics.correlation_matrix(xs, dataset, *, center=True, scale=True, **kwargs)¶

Analysis: Compute the correlation matrix of the set of attributes.

NOTE: This analysis does not require run().

Parameters

xs (Sequence[str]) – A list of attribute names to compute correlation matrix for.
dataset (DataSet) – The DataSet containing these attributes.
center (bool) – Whether to center the columns before computing correlation matrix. If False, proceed assuming the columns are already centered.
scale (bool) – Whether to divide covariance matrix by number of rows. If False, do not divide.
max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to None, this function will poll the server indefinitely. If it is run with scale or center set to True, the timeout will be multiplied. Defaults to waiting forever.

Returns

The correlation matrix.

Return type

np.ndarray

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.covariance_matrix(xs, dataset, *, center=True, scale=True, **kwargs)¶

Analysis: Compute the covariance matrix of the set of attributes.

NOTE: This analysis does not require run().

Parameters

xs (Sequence[str]) – A list of attribute names that are the features.
dataset (DataSet) – The DataSet of the attributes.
center (bool) – Whether to center the columns before compute the covariance matrix. If False, assume the columns are centered.
scale (bool) – Whether to divide the matrix by number of rows. If False, do not divide.
max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to None, this function will poll the server indefinitely. If it is run with scale or center set to True, the timeout will be multiplied. Defaults to waiting forever.

Returns

The covariance matrix.

Return type

np.ndarray

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.describe(dataset, attributes=None)¶

Describe the columns of the dataset for use in data exploration.

The describe function provides a way for an analyst to perform initial rough data exploration on a dataset. To get more accurate statistics, the individual functions mean(), count(), et cetera, are recommended. This function does not use the analysis cache of the other statistics functions.

Numeric columns are described by their count, mean, standard deviation, minimum, maximum and the quartiles. Categorical columns (factors and booleans) are described by their count, distinct count and frequency of the most frequent element.

Parameters

dataset (DataSet) – The DataSet to be described
attributes (Union[None, Attribute, str, Sequence[Union[Attribute, str]]]) – The attributes to describe. If a value is not provided, or None, describe all attributes.

Returns

Prepared analysis for describing the dataset. Execute the analysis using the run() method.

Return type

DescribeAnalysis

leapyear.analytics.groupby_agg_view(dataset, attrs, agg_attr=None, agg_type=<GroupByAggType.COUNT: 1>, *, max_groupby_agg_keys=100000000, size_threshold=None, agg_attr_and_type=None)¶

Compute aggregate statistic within each group and output aggregate results.

Only groups with estimated size larger than minimum_dataset_size will be returned. This parameter can be set in run.

Parameters

dataset (DataSet) – The DataSet to perform groupby and aggregation on
attrs (Sequence[Union[Attribute, str]]) – List of attributes to perform groupby.
agg_attr (Optional[str]) – Compute aggregate statistics on this column within each group
agg_type (Union[GroupByAggType, str]) – Aggregate type. ‘count’, ‘mean’ or ‘sum’.
max_groupby_agg_keys (int) – This value prevents submitting computations that have a very large number of groupby keys. By default, it raises GroupbyAggTooManyKeysError if the number of groups exceeds 100000000.
size_threshold (Optional[int]) – Deprecated: see minimum_dataset_size in the run method.
agg_attr_and_type (Union[Tuple[Union[GroupByAggType, str], Optional[str]], Sequence[Tuple[Union[GroupByAggType, str], Optional[str]]], None]) – List of tuples (agg_type, agg_attr). Compute aggregate statistics defined by the agg_type on the column within each group. agg_type can be ‘count’, ‘mean’ or ‘sum’.

Returns

Analysis object that can be executed using run method to return aggregation results. The results can be accessed as a pandas dataframe using .to_dataframe().

Return type

GroupbyAggAnalysis

Note: privacy exposure estimate for this analysis is not supported.

Example

For each age group and gender, compute the mean income.

>>> groupby_agg_view(ds, ["AGE", "GENDER"], "INCOME", "mean").run(minimum_dataset_size=1000)

For each week, compute the mean and total transaction amount.

>>> groupby_agg_view(ds, ["WEEK"], agg_attr_and_type=[("mean", "AMOUNT"), ("sum", "AMOUNT")]).run()

Look at Randomization Intervals for each group (only for ‘count’ and ‘sum’).

>>> rr = groupby_agg_view(ds, ["WEEK"], "AMOUNT", "mean").run(rich_result=True)
>>> ri_dict = rr.metadata
>>> ri_dict
{
    (1, ): RandomizationInterval(...),
    (2, ): RandomizationInterval(...)
    ...
}
>>> ri_dict[(1, )]
RandomizationInterval(...)

Look at Randomization Interval for multiple aggregate results.

>>> rr = groupby_agg_view(ds, ["YEAR", "WEEK"], agg_attr_and_type=[("mean", "AMOUNT"), ("sum", "AMOUNT")]).run()
>>> ri_dict = rr.metadata
>>> ri_dict[(2020, 1)][0]
RandomizationInterval(...)

Machine Learning¶

Unsupervised learning¶

leapyear.analytics.kmeans(xs, dataset, n_iters=10, n_clusters=3)¶

Analysis: K-means clustering.

Identifies centers of clusters for a set of data points, by

Randomly initializing a chosen number of cluster centers (centroids) in the feature space

Associating each data point with the nearest centroid

Iteratively adjusting centroids to locations based on differentially private computation of the mean for each feature

Parameters

xs (List[str]) – A list of attribute names that are the features.
dataset (DataSet) – The DataSet of the attributes.
n_iters (int) – Number of iterations to run k-means for
n_clusters (int) – Number of clusters to generate

Returns

Analysis object that can be executed using the run() method. Once executed, it would output clustering analysis results, such as centroids.

Return type

ClusteringAnalysis

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.eval_kmeans(centroids, xs, dataset)¶

Analysis: Evaluate the K-means model.

Evaluate the clustering model by computing the Normalized Intra Cluster Variance (NICV).

Parameters

centroids (ClusterModel) – The model (generated using kmeans) to evaluate
xs (List[str]) – A list of attribute names that are the features.
dataset (DataSet) – The DataSet of the attributes.

Returns

Analysis representing evaluation of a clustering model. It can be executed using the run() method to output evaluation metric value.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.pca(xs, dataset, **kwargs)¶

Principal Component Analysis.

Compute the Principal Component Analysis (PCA) of the set of attributes using a differentially private algorithm.

NOTE: This analysis does not require run().

Parameters

xs (List[str]) – A list of attribute names representing features to be considered for this analysis.
dataset (DataSet) – DataSet that includes these attributes.
max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to None, this function will poll the server indefinitely. If it is run with scale or center set to True, the timeout will be multiplied. Defaults to waiting forever.

Return type

Tuple[ndarray, ndarray]

Returns

explained_variances – Variance explained by each of the principal components - in other words, variance of each principal component coordinate when considered as feature on the input dataset.
pca_matrix – Transformation matrix, that can be used to translate original features to principal component coordinates. If all principal components are included, this becomes a square matrix corresponding to orthogonal transformation (e.g. reflection).

This matrix can be used to generate principal component features using leapyear.dataset.DataSet.transform() operation, as in:

tfds = ds.transform(x_vars, pca_matrix, 'pca')

NOTE: Signs may not match PCA transformation matrix computed by scikit-learn.

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

Supervised learning¶

leapyear.analytics.basic_linreg(xs, y, dataset, *, affine=True, l1reg=0.0, l2reg=1.0, parameter_bounds=None)¶

Analysis: Linear regression.

Implements a differentially private algorithm to represent outcome (target) variable as a linear combination of selected features.

Note

To help ensure that the differentially private training process can effectively optimize regression coefficients, it’s important to re-scale features (both dependent/target and independent/explanatory) to a similar domain (e.g. [0,1]). This can be done using leapyear.feature.BoundsScaler and will help ensure that the domain being searched for the coefficient will include the optimal model. See LeapYear guides for this and other recommendations on training accurate regressions.

Note

Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.

Please see the guides for more details on using regressions in LeapYear.

Parameters

xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.
y (Union[Attribute, str]) – The attribute name that is the outcome.
dataset (DataSet) – The DataSet of the attributes.
affine (bool) – If True, fit an intercept term.
l1reg (float) – The L1 regularization. Default value: 0.0.
l2reg (float) – The L2 regularization. Default value: 1.0. Must be at least 0.0001 to limit the randomization effect for models optimized via objective perturbation.
parameter_bounds (Optional[List[Tuple[float, float]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value: -10.0 to 10.0 for each parameter

Returns

Analysis representing the regression problem. It can be executed using the run() method to output calibrated model.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.generalized_linreg(xs, y, dataset, *, affine=True, l2reg=1.0, weight=None, offset=None, max_iters=25, family='gaussian', link='identity', link_power=0, variance_power=1)¶

Analysis: Generalized linear regression.

Implements a differentially private algorithm to represent outcome (target) variable as a linear combination of selected features. This computation is based on the iterative weighted least squares algorithm. Trains using the “glm” algorithm.

Available generalizations include:

offset of outputs based on pre-existing model - this enables modeling of residual

use of alternative link functions applied to the linear combination of features

application of regularization and weights during model optimization.

Note

Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.

Please see the guides for more details on using regressions in LeapYear.

Parameters

xs (List[Union[Attribute, str]]) – A list of Attributes or attribute names to be used as features.
y (Union[Attribute, str]) – The Attribute or attribute name of the target.
dataset (DataSet) – The DataSet of the attributes.
affine (bool) – True if the algorithm should fit an intercept term.
l2reg (float) – The L2 regularization parameter. Must be non-negative.
weight (Union[Attribute, str, None]) – Optional column used to weight each sample. Implies generalized regression.
offset (Union[Attribute, str, None]) – Optional column for offset in offset regression. Implies generalized regression.
max_iters (Optional[int]) – Optional maximum number of iterations for fitting the regression. Note that regardless of this setting, the system would often stop before reaching max_iterations - e.g. after a single iteration. In such cases, higher value for max_iterations may lead to less privacy allocated to each iteration, and ultimately, higher randomization effect.
family (Optional[str]) – Optional distribution of the label. Implies generalized regression. Possible values here are ‘gaussian’ (the default), ‘poisson’, ‘gamma’ and ‘tweedie’.
link (Optional[str]) – Optional link function between mean of label distribution and prediction. Implies generalized regression. Possible values depend on family: ‘gaussian’ supports only ‘identity’ (default), ‘log’ and ‘inverse’; ‘poisson’ supports only ‘log’ (default), ‘identity’ and ‘sqrt’; ‘gamma’ supports only ‘inverse’ (default), ‘identity’ and ‘log’. There is no link function for the ‘tweedie’ family, use variance_power and link_power parameters instead.
link_power (Optional[int]) – For the ‘tweedie’ distribution only, the exponent of the link function. Default value is 0, which is equivalent to ‘identity’ link.
variance_power (Optional[int]) – For the ‘tweedie’ distribution only, the exponent of the variance. Default value is 1, which is equivalent to ‘gaussian’ family.

Returns

Analysis of the regression problem, which could be executed using run() function to output calibrated model.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.logreg(xs, y, dataset, affine=True, l1reg=0.0, l2reg=1.0)¶

Analysis: Logistic regression.

Implements a differentially private algorithm to represent outcome (target) variable as a logit-transformation of a linear combination of selected features. Trains using the “basic” algorithm.

Available generalizations include

regularization applied during model optimization

Note

To help ensure that the differentially private training process can effectively optimize regression coefficients, it’s important to re-scale features (both dependent/target and independent/explanatory) to a similar domain (e.g. [0,1]). This can be done using leapyear.feature.BoundsScaler and will help ensure that the domain being searched for the coefficient will include the optimal model. See LeapYear guides for this and other recommendations on training accurate regressions.

Note

Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.

Please see the guides for more details on using regressions in LeapYear.

Parameters

xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.
y (Union[Attribute, str]) – The attribute name that is the outcome.
dataset (DataSet) – The DataSet of the attributes.
affine (bool) – If True, fit an intercept term.
l1reg (float) – The L1 regularization. Default value: 0.0.
l2reg (float) – The L2 regularization. Default value: 1.0.

Returns

Analysis training the logistic regression model. It can be executed using the run() method to output the calibrated model.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.generalized_logreg(xs, y, dataset, *, affine=True, l1reg=1.0, l2reg=1.0, weight=None, offset=None, max_iters=25, link='logit')¶

Analysis: Generalized logistic regression.

Implements a differentially private algorithm to represent the outcome (target) variable as a logit-transformation of a linear combination of selected features. This computation is based on the iterative weighted least squares algorithm. Trains using the “glm” algorithm.

Available generalizations include:

offset of outputs based on pre-existing model - this enables modeling of residual

use of alternative link functions applied to the linear combination of features

application of regularization and weights during model optimization.

Note

Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.

Please see the guides for more details on using regressions in LeapYear.

Parameters

xs (List[Union[Attribute, str]]) – A list of Attributes or attribute names to be used as features.
y (Union[Attribute, str]) – The Attribute or attribute name of the target.
dataset (DataSet) – The DataSet of the attributes.
affine (bool) – True if the algorithm should fit an intercept term.
l2reg (float) – The L2 regularization parameter. Must be non-negative.
weight (Union[Attribute, str, None]) – Optional column used to weight each sample. Implies generalized regression.
offset (Union[Attribute, str, None]) – Optional column for offset in offset regression. Implies generalized regression.
max_iters (Optional[int]) – Optional maximum number of iterations for fitting the regression. Note that regardless of this setting, the system would often stop before reaching max_iterations - e.g. after a single iteration. In such cases, higher value for max_iterations may lead to less privacy allocated to each iteration, and ultimately, higher randomization effect.
link (Optional[str]) – Optional link function between mean of label distribution and prediction. Implies generalized regression. Possible values are ‘logit’ (default), ‘probit’ and ‘cloglog’.

Returns

Analysis of the regression problem, which could be executed using run() function to output calibrated model.

Return type

GradientBoostedTreeClassifierModelAnalysis

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.gradient_boosted_tree_classifier(xs, y, dataset, max_depth=3, max_iters=5, max_bins=32)¶

Analysis: Gradient boosted tree classifier.

This analysis trains a randomized variant of gradient boosted tree classifier to predict a BOOLEAN outcome (target).

The algorithm works by iteratively training individual decision trees to predict a “residual” of the model built so far, and then integrating each newly built decision tree into the ensemble model to better predict the probability of the positive label.

Weights are used at different stages:

during training of individual decision trees, to focus attention on the areas where the model consistently underperforms, and
when combining individual decision trees to predict probability of the positive label.

Calibrated level of randomization is applied to individual leaves of the decision trees to help protect privacy of the individual records used for model training.

Parameters

xs (List[Union[Attribute, str]]) – A list of attributes or attribute names that are used as explanatory features for the analysis. Each attribute must be either BOOL, INT, REAL or FACTOR. Nullable types are not supported and must be converted to non-nullable - e.g. via coalesce.
y (Union[Attribute, str]) – The attribute or attribute name that is used as an outcome (target) of the classification model. Must be BOOLEAN type, as only binary classification models are supported. Nullable types are not supported and must be converted to non-nullable - e.g. via coalesce.
dataset (DataSet) – The DataSet containing both explanatory features and outcome attributes.
max_depth (int) – The maximum depth (or height) of any tree in the ensemble produced by the algorithm. Default: 3
max_iters (int) – The maximum number of iterations of the algorithm. This corresponds to the maximum number of individual decision trees in the ensemble. Default: 5
max_bins (int) –
The maximum number of bins for features used in constructing trees. Default: 32

Note

Maximum number of bins should be set to no less than the number of distinct possible values of the FACTOR attributes used as explanatory features.

Returns

Analysis that will train the gradient boosted tree classifier. It can be executed using the run() method.

Return type

See also

Gradient tree boosting

Unsupported Backends: Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.random_forest(xs, y, dataset, n_trees=100, height=3)¶

Analysis: Random Forest Classifier.

Generate a random forest model to predict probability associated with each target class.

Random forests combine many decision trees in order to reduce the risk of overfitting.

Each decision tree is developed on a random subset of observations - and is limited to prescribed height.

Individual node split decisions are made to maximize split value (or gain) - with a variation that a differentially private algorithm is used to count the number of observations belonging to each target class on both sides of the split.

Specifically, split value (or gain) is defined as reduction in combined Gini impurity measure, associated with introducing the split for a given parent node. Here

Gini impurity for any given node (parent or child) is calculated based on distribution of observations within the node across different outcome (target) classes

To compute combined impurity of the pair of nodes, individual node impurities for the two children nodes are averaged proportionately to their share of observations

Categorical features are typically handled by evaluating various splits corresponding to random subsets of the available categories.

Parameters

xs (List[Union[Attribute, str]]) – A list of attribute names that are used as features for explanatory analysis.
y (Union[Attribute, str]) – The attribute name that is the outcome (target).
dataset (DataSet) – The DataSet containing both explanatory features and outcome attributes.
n_trees (int) – The number of trees to use in the random forest. Default: 100
height (int) – The maximum height of the trees. Default: 3

Returns

Analysis training the random forest model. It can be executed using the run() method to output the analysis results which include the calibrated random forest model, feature importance statistics, etc.

Return type

ForestModelClassifierAnalysis

See also

Gini impurity

Unsupported Backends: Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.regression_trees(xs, y, dataset, n_trees=100, height=3)¶

Analysis: Random Forest Regressor (regression trees).

Generate a regression trees model to predict value of target variable.

Regression trees are built similarly to random forests, but instead of predicting the probability that the target variable takes a certain categorical value (i.e., classification), they predict a real value of the target variable (i.e., regression).

The impurity metric in this case is the variance of the target variable for the datapoints that fall into the current node’s partition.

Parameters

xs (List[Union[Attribute, str]]) – A list of attribute names that are used as features for explanatory analysis.
y (Union[Attribute, str]) – The attribute name that is the outcome (target).
dataset (DataSet) – The DataSet containing both explanatory features and target attribute.
n_trees (int) – The number of trees to use in the random forest. Default: 100.
height (int) – The maximum height of the trees. Default: 3.

Returns

Analysis training the regression trees model. It can be executed using the run() method to output the analysis results which include the calibrated random forest model, feature importance statistics, etc.

Return type

ForestModelRegressionAnalysis

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.eval_linreg(glm, xs, y, dataset, metric='mse')¶

Analysis: Evaluate a linear regression model.

Parameters

glm (GLM) – The model (generated using linreg) to evaluate
xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.
y (Union[Attribute, str]) – The attribute name that is the outcome.
dataset (DataSet) – The DataSet of the attributes.
metric (Union[str, Metric]) –
Linear regression evaluation metric: ‘mse’/’mean_squared_error’ or ‘mae’/’mean_absolute_error’.

Note

During the calculation of mse and mae metrics, the individual values of absolute error are restricted to be no greater than the length of the interval of possible values of the target attribute, as seen in the dataset schema. For example, if the target attribute contains values in the interval [-50, 50], then the absolute error of any individual prediction used in computing the mean will be no greater than 100.

Returns

Analysis representing evaluation of a regression model. It can be executed using the run() method to output evaluation metric value.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.eval_logreg(glm, xs, y, dataset, metric='accuracy')¶

Analysis: Evaluate a logistic regression model.

Parameters

glm (GLM) – The model (generated using logreg) to evaluate
xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.
y (Union[Attribute, str]) – The attribute name that is the outcome.
dataset (DataSet) – The DataSet of the attributes.
metric (Union[str, Metric]) – Logistic regression evaluation metric. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’.

Returns

Analysis representing evaluation of a logistic regression model. It can be executed using the run() method to output evaluation metric value.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.eval_gbt_classifier(gbt, xs, y, dataset, metric='accuracy')¶

Analysis: Evaluate a gradient boosted tree (GBT) classifier model.

Parameters

gbt (GradientBoostedTreeClassifier) – The model to evaluate.
xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.
y (Union[Attribute, str]) – The attribute name that is the outcome.
dataset (DataSet) – The DataSet of the attributes.
metric (Union[str, Metric]) – GBT evaluation metric. Currently only supports ‘accuracy’.

Returns

Analysis representing evaluation of a GBT classifier model. It can be executed using the run() method to output the value of the evaluation metric.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.eval_random_forest(rf, xs, y, dataset, metric='accuracy')¶

Analysis: Evaluate a random forest model.

Parameters

rf (RandomForestClassifier) – The model (generated using random_forest) to evaluate
xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.
y (Union[Attribute, str]) – The attribute name that is the outcome.
dataset (DataSet) – The DataSet of the attributes.
metric (Union[str, Metric]) – Forest evaluation metric. Examples: ‘logloss’, ‘accuracy’, ‘auroc’, ‘aupr’

Returns

Analysis representing evaluation of a random forest model. It can be executed using the run() method to output evaluation metric value.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.eval_regression_trees(rf, xs, y, dataset, metric='mse')¶

Analysis: Evaluate a regression trees model.

Parameters

rf (RandomForestClassifier) – The model (generated using regression_trees) to evaluate
xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.
y (Union[Attribute, str]) – The attribute name that is the outcome.
dataset (DataSet) – The DataSet of the attributes.
metric (Union[str, Metric]) –
Model evaluation metric. Examples: ‘mse’/’mean_squared_error’ or ‘mae’/’mean_absolute_error’.

Note

During the calculation of mse and mae metrics, the individual values of absolute error are restricted to be no greater than the length of the interval of possible values of the target attribute, as seen in the dataset schema. For example, if the target attribute contains values in the interval [-50, 50], then the absolute error of any individual prediction used in computing the mean will be no greater than 100.

Returns

Analysis representing evaluation of a regression trees model. It can be executed using the run() method to output evaluation metric value.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.roc(model, xs, y, dataset, thresholds=5)¶

Compute the ConfusionCurves.

For each threshold value, compute the normalized confusion matrix using the model. The confusion matrix contains the true positive rate, the true negative rate, the false positive rate and the false negative rate.

Parameters

model (Union[GLM, RandomForestClassifier]) – The model to evaluate the confusion curves on.
xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.
y (Union[Attribute, str]) – The attribute name that is the outcome.
dataset (DataSet) – The DataSet of the attributes.
thresholds (Union[int, Sequence[float]]) – If int, then generate approximately thresholds (rounded to the closest power of 2) number of thresholds using recursive medians. If a sequence of floats, then use the list as the thresholds.

Returns

Analysis of the confusion curve, which can be executed using the run() method to output various evaluation metrics.

Return type

ConfusionModelAnalysis

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.cross_val_score_linreg(xs, y, dataset, *, affine=True, l1reg=1.0, l2reg=1.0, cv=3, metric='mean_squared_error', parameter_bounds=None)¶

Analysis: Compute the linear regression cross validation score of the set of attributes.

Parameters

xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.
y (Union[Attribute, str]) – The attribute name that is the outcome.
dataset (DataSet) – The DataSet of the attributes.
affine (bool) – If True, fit an intercept term.
l1reg (float) – The L1 regularization. Default value: 1.0.
l2reg (float) – The L2 regularization. Default value: 1.0. Must be at least 0.0001 to limit the randomization effect.
cv (int) – Number of folds in k-fold cross validation.
metric (Union[str, Metric]) – The metric for evaluating the regression. Examples: ‘mae’, ‘mse’, ‘r2’.
parameter_bounds (Optional[List[Tuple[float, float]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value: -10.0 to 10.0 for each parameter

Returns

Analysis of the cross-validation scores for the regression model. It can be executed using the run() method to generate cross-validation results.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.cross_val_score_logreg(xs, y, dataset, cv=3, affine=True, l1reg=0.1, l2reg=0.1, metric='accuracy')¶

Analysis: Compute the logistic regression cross validation score of the set of attributes.

Parameters

xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.
y (Union[Attribute, str]) – The attribute name that is the outcome.
dataset (DataSet) – The DataSet of the attributes.
affine (bool) – If True, fit an intercept term.
l1reg (float) – The L1 regularization. Default value: 0.1.
l2reg (float) – The L2 regularization. Default value: 0.1. Must be at least 0.0001 to limit the randomization effect.
cv (int) – Number of folds in k-fold cross validation.
metric (Union[str, Metric]) – The metric for evaluating the logistic regression. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’.

Returns

Analysis of the cross-validation scores for the regression model. It can be executed using the run() method to generate cross-validation results.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.cross_val_score_random_forest(xs, y, dataset, n_trees=100, height=3, cv=3, metric='mean_squared_error')¶

Analysis: Compute the random forest cross validation score of the set of attributes.

Parameters

xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.
y (Union[Attribute, str]) – The attribute name that is the outcome.
dataset (DataSet) – The DataSet of the attributes.
n_trees (int) – Number of trees.
height (int) – Maximum height of trees.
cv (int) – Number of folds in k-fold cross validation
metric (Union[str, Metric]) – The metric for evaluating the regression

Returns

Analysis of the cross-validation scores for the random forest model. It can be executed using the run() method to generate cross-validation results.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.cross_val_score_regression_trees(xs, y, dataset, n_trees=100, height=3, cv=3, metric='mean_squared_error')¶

Analysis: Compute the regression trees cross validation score of the set of attributes.

Parameters

xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.
y (Union[Attribute, str]) – The attribute name that is the outcome.
dataset (DataSet) – The DataSet of the attributes.
n_trees (int) – Number of trees.
height (int) – Maximum height of trees.
cv (int) – Number of folds in k-fold cross validation
metric (Union[str, Metric]) – The metric for evaluating the regression

Returns

Analysis of the cross-validation scores for the regression trees model. It can be executed using the run() method to generate cross-validation results.

Return type

Unsupported Backends

Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.hyperopt_linreg(xs, y, dataset, *, cv, train_fraction, metric, n_iter=100, l1_bounds=(1e-10, 10000000000.0), l2_bounds=(1e-10, 10000000000.0), fit_intercept=None, parameter_bounds=None)¶

Analysis: Hyperparameter optimization for linear regression.

Calibrate a linear regression model by optimizing its cross-validation score with respect to model hyperparameters - L_1 and L_2 regularization parameters and presence of intercept.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
    pick a set of hyperparameters (hp) to test based on cv_history.
    use hp to calibrate a model on each cross-validation set
    evaluate it on corresponding sample set-aside for cross-validation
    compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.

Parameters

xs (List[Union[Attribute, str]]) – A list of attributes that are the features.
y (Union[Attribute, str]) – The target attribute.
dataset (DataSet) – The dataset containing the attributes.
cv (int) – The number of cross-validation steps to perform for each candidate set of hyperparameters.
train_fraction (float) – The fraction of the dataset to use set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.
metric (Union[str, Metric]) – Model performance metric to optimize. Examples: ‘mean_squared_error’, ‘mean_absolute_error’, ‘r2’
n_iter (int) – The number of optimization steps. Default: 100
l1_bounds (Tuple[float, float]) – Lower and upper bounds for l1 regularization. Default: (1E-10, 1E10)
l2_bounds (Tuple[float, float]) – Lower and upper bounds for l2 regularization. Default: (1E-10, 1E10)
fit_intercept (Optional[bool]) – If None, search will consider both options.
parameter_bounds (Optional[List[Tuple[float, float]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value: -10.0 to 10.0 for each parameter

Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

model calibrated with recommended hyperparameters and

its performance on the holdout dataset.

Return type

Unsupported Backends: Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.hyperopt_logreg(xs, y, dataset, cv, train_fraction, metric, n_iter=100, l1_bounds=(1e-10, 10000000000.0), l2_bounds=(1e-10, 10000000000.0), fit_intercept=None)¶

Analysis: Hyperparameter optimization for logistic regression.

Calibrate a logistic regression model by optimizing its cross-validation score with respect to model hyperparameters - L_1 and L_2 regularization parameters and presence of intercept.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
    pick a set of hyperparameters (hp) to test based on cv_history.
    use hp to calibrate a model on each cross-validation set
    evaluate it on corresponding sample set-aside for cross-validation
    compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.

Parameters

xs (List[Union[Attribute, str]]) – A list of attributes that are the features.
y (Union[Attribute, str]) – The target attribute.
dataset (DataSet) – The dataset containing the attributes.
cv (int) – The number of cross-validation steps to perform for each candidate set of hyperparameters.
train_fraction (float) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.
metric (Union[str, Metric]) – Model performance metric to optimize. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’.
n_iter (int) – The number of optimization steps. Default: 100
l1_bounds (Tuple[float, float]) – Lower and upper bounds for l1 regularization. Default: (1E-10, 1E10)
l2_bounds (Tuple[float, float]) – Lower and upper bounds for l2 regularization. Default: (1E-10, 1E10)
fit_intercept (Optional[bool]) – If None, search will consider both options.

Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

model calibrated with recommended hyperparameters and

its performance on the holdout dataset.

Return type

Unsupported Backends: Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.hyperopt_rf(xs, y, dataset, cv, train_fraction, metric, n_iter=100, max_trees=1000, max_depth=20)¶

Analysis: Hyperparameter optimization for a random forest model.

Calibrate a random forest model by optimizing its cross-validation score with respect to model hyperparameters - number of trees and individual tree depth (or height) limit.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
    pick a set of hyperparameters (hp) to test based on cv_history.
    use hp to calibrate a model on each cross-validation set
    evaluate it on corresponding sample set-aside for cross-validation
    compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.

Parameters

xs (List[Union[Attribute, str]]) – A list of attributes that are the features.
y (Union[Attribute, str]) – The target attribute.
dataset (DataSet) – The dataset containing the attributes.
cv (int) – The number of cross-validation steps to perform for each candidate set of hyperparameters.
train_fraction (float) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.
metric (Union[str, Metric]) – The metric to optimize. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’
n_iter (int) – The number of optimization steps. Default: 100
max_trees (int) – Maximum number of trees. Default: 1000
max_depth (int) – Maximum tree depth. Default: 20

Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

model calibrated with recommended hyperparameters and

its performance on the holdout dataset.

Return type

Unsupported Backends: Not supported for the following LeapYear compute backend(s): snowflake.

leapyear.analytics.hyperopt_regression_trees(xs, y, dataset, cv, train_fraction, metric, n_iter=100, max_trees=1000, max_depth=20)¶

Analysis: Hyperparameter optimization for a regression trees model.

Calibrate a regression trees model by optimizing its cross-validation score with respect to model hyperparameters - number of trees and individual tree depth (or height) limit.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
    pick a set of hyperparameters (hp) to test based on cv_history.
    use hp to calibrate a model on each cross-validation set
    evaluate it on corresponding sample set-aside for cross-validation
    compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.

Parameters

xs (List[Union[Attribute, str]]) – A list of attributes that are the features.
y (Union[Attribute, str]) – The target attribute.
dataset (DataSet) – The dataset containing the attributes.
cv (int) – The number of cross-validation steps to perform for each candidate set of hyperparameters.
train_fraction (float) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.
metric (Union[str, Metric]) – The metric to optimize. Examples: ‘mae’, ‘mse’, ‘r2’
n_iter (int) – The number of optimization steps. Default: 100
max_trees (int) – Maximum number of trees. Default: 1000
max_depth (int) – Maximum tree depth. Default: 20

Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

model calibrated with recommended hyperparameters and

its performance on the holdout dataset.

Return type