LeapYear Python Client¶
Welcome to the Python API documentation for the LeapYear Secure ML module. This documentation is meant to provide a detailed reference for programmatically interacting with the LeapYear Core software.
Visit guides.leapyear.io to access higher-level material such as:
an executive-level introduction to LeapYear.
an architecture overview.
a step-by-step tutorial for building your first model in LeapYear.
a data science reference containing code snippets for data exploration, feature engineering, supervised and unsupervised learning.
best practices for differentially private analytics.
guides for admins and architects.
Table of Contents:¶
Getting Started¶
Connecting to LeapYear and Exploring¶
The first step to using LeapYear’s data security platform for analysis is
getting connected. To get started, we’ll import the Client
object from the leapyear
python library and
connect to the LeapYear server using our user credentials.
Credentials used for this tutorial:
>>> url = 'http://localhost:{}'.format(os.environ.get('LY_PORT', 4401))
>>> username = 'tutorial_user'
>>> password = 'abcdefghiXYZ1!'
Import the Client
object:
>>> from leapyear import Client
Create a connection:
>>> client = Client(url, username, password)
>>> client.connected
True
>>> client.close()
>>> client.connected
False
Alternatively, Client
is also a context manager, so the connection
is automatically closed at the end of a with
block:
>>> with Client(url, username, password) as client:
... # carry out computations with connection to LeapYear
... client.connected
True
>>> client.connected
False
Databases, Tables and Columns¶
Once we’ve obtained a connection to LeapYear, we can look through the databases and tables that are available for data analysis:
>>> client = Client(url, username, password)
Examine databases available to the user:
>>> client.databases.keys()
dict_keys(['tutorial'])
>>> tutorial_db = client.databases['tutorial']
>>> tutorial_db
<Database tutorial>
Examine tables within the database tutorial:
>>> sorted(tutorial_db.tables.keys())
['classification',
'regression1',
'regression2',
'twoclass']
>>> example1 = tutorial_db.tables['regression1']
>>> example1
<Table tutorial.regression1>
Examine the columns on table tutorial_db.regression1:
>>> example1.columns
{'x0': <TableColumn tutorial.regression1.x0: type='REAL' bounds=(-4.0, 4.0) nullable=False>,
'x1': <TableColumn tutorial.regression1.x1: type='REAL' bounds=(-4.0, 4.0) nullable=False>,
'x2': <TableColumn tutorial.regression1.x2: type='REAL' bounds=(-4.0, 4.0) nullable=False>,
'y': <TableColumn tutorial.regression1.y: type='REAL' bounds=(-400.0, 400.0) nullable=False>}
Column Types¶
TableColumn
objects include their type, bounds, and nullability.
>>> col_x0 = example1.columns['x0']
>>> col_x0.type
<ColumnType.REAL: 'REAL'>
>>> col_x0.bounds
(-4.0, 4.0)
>>> col_x0.nullable
False
The possible types are: BOOL
, INT
, REAL
, FACTOR
, DATE
, TEXT
, and DATETIME
.
INT
, REAL
, DATE
, and DATETIME
have publicly available bounds, representing the lower and upper limits of the data in the column. FACTOR
also has bounds, representing the set of strings available in the column. BOOL
and TEXT
columns have no bounds.
The DataSet Class¶
Once we’ve established a connection to the LeapYear server using the
Client
class, we can import the
DataSet
to access and analyze tables.
>>> from leapyear import DataSet
We can access tables, either using the client interface as above:
>>> ds_example1 = DataSet.from_table(example1)
or by directly referencing the table by name:
>>> ds_example1 = DataSet.from_table('tutorial.regression1')
The DataSet
class is the primary way of interacting
with data in the LeapYear system. A DataSet
is associated with
collection of Attributes
, which can
be used to compute statistics. The
DataSet
class allows the user to manipulate and analyze the attributes of
a data source using a variety of relational operations such as
column selection, row selection based on conditions, unions, joins, etc.
An instance of the Attribute
class represents either an individual named
column in the DataSet
or a transformation of one or several of such
columns via supported operations.
Attributes also have types, which can be inspected the same as the types in a DataSet
schema.
Attributes
can be manipulated using most built in Python operations, such as +
, *
, and abs
.
>>> ds_example1.schema
Schema([('x0', AttributeType(name='REAL', nullable=False, domain=(-4, 4))),
('x1', AttributeType(name='REAL', nullable=False, domain=(-4, 4))),
('x2', AttributeType(name='REAL', nullable=False, domain=(-4, 4))),
('y', AttributeType(name='REAL', nullable=False, domain=(-400, 400)))])
>>> ds_example1.schema['x0']
AttributeType(name='REAL', nullable=False, domain=(-4, 4))
>>> ds_example1.schema['x0'].name
'REAL'
>>> ds_example1.schema['x0'].nullable
False
>>> ds_example1.schema['x0'].domain
(-4.0, 4.0)
>>> attr_x0 = ds_example1['x0']
>>> attr_x0
<Attribute: x0>
>>> attr_x0 + 4
<Attribute: x0 + 4>
>>> attr_x0.type
AttributeType(name='REAL', nullable=False, domain=(-4, 4))
>>> attr_x0.type.name
'REAL'
>>> attr_x0.type.nullable
False
>>> attr_x0.type.domain
(-4.0, 4.0)
In the following example, we’ll take a few attributes from the table
tutorial.regression1
, adding one to the x1
attribute and multiplying
x2
by three. The bounds are altered to reflect the change.
>>> ds1 = ds_example1.map_attributes(
... {'x1': lambda att: att + 1.0, 'x2': lambda att: att * 3.0}
... )
>>> ds1.schema
Schema([('x0', AttributeType(name='REAL', nullable=False, domain=(-4, 4))),
('x1', AttributeType(name='REAL', nullable=False, domain=(-3, 5))),
('x2', AttributeType(name='REAL', nullable=False, domain=(-12, 12))),
('y', AttributeType(name='REAL', nullable=False, domain=(-400, 400)))])
We can use DataSet
to filter the data to examine subsets
of the data, e.g. by applying predicates to the data:
>>> ds2 = ds_example1.where(ds_example1['x1'] > 1)
>>> ds2.schema
Schema([('x0', AttributeType(name='REAL', nullable=False, domain=(-4, 4))),
('x1', AttributeType(name='REAL', nullable=False, domain=(1, 4))),
('x2', AttributeType(name='REAL', nullable=False, domain=(-4, 4))),
('y', AttributeType(name='REAL', nullable=False, domain=(-400, 400)))])
Data Analysis¶
Statistics¶
The LeapYear system is designed to allow access to various statistical
functions and develop machine learning models based on data in DataSet
.
The analytics function is not executed until the run()
method is called on it. This
allows inspection of the overall workflow and early reporting of errors. All analysis
functions are located in the leapyear.analytics
module.
>>> import leapyear.analytics as analytics
Many common statistics functions are available including:
Next is an example of obtaining simple statistics from the dataset:
>>> mean_analysis = analytics.mean('x0', ds_example1)
>>> mean_analysis.run()
0.039159280186637294
>>> variance_analysis = analytics.variance('x0', ds_example1)
>>> variance_analysis.run()
1.0477940098374177
>>> quantile_analysis = analytics.quantile(0.25, 'x0', ds_example1)
>>> quantile_analysis.run()
-0.6575000000000001
By combining statistics with the ability to transform and filter data, we can look at various statistics associated to subsets of the data:
>>> analytics.mean('x0', ds_example1).run()
0.039159280186637294
>>> ds2 = ds_example1.where(ds_example1['x1'] > 1)
>>> analytics.mean('x0', ds2).run()
0.14454229785771325
Machine Learning¶
The leapyear.analytics
module also supports various machine learning (ML)
models, including
regression-based models (linear, logistic, generalized),
tree-based models (random forests for classification and regression tasks),
unsupervised models (e.g. K-means, PCA),
the ability do optimize model hyperparemeters via search with cross-validation, and
the ability to evaluate model performance based on a variety of common validation metrics.
In this section we will share some examples of the machine learning tools provided by the LeapYear system.
The Effect of L2 Regularization on Model Coefficients¶
The following example code shows a common theoretical result from ML: as the L2 regularization parameter alpha increases, we see the coefficients of the model gradually approach zero. This is depicted in the graph generated below:
>>> n_alphas = 20
>>> alphas = np.logspace(-2,2, n_alphas)
>>>
>>> # example3 has 0 and 1 in the y column. Here, we convert 1 to True and 0 to False
>>> ds_example3 = DataSet\
... .from_table('tutorial.classification')\
... .map_attribute('y', lambda att: att.decode({1: True}).coalesce(False))
>>>
>>> models = []
>>> for alpha in alphas:
... model = analytics.generalized_logreg(
... ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9'],
... 'y',
... ds_example3,
... affine=False,
... l1reg=0.001,
... l2reg=alpha
... ).run()
... models.append(model)
>>>
>>> coefs = np.array([np.append(m.coefficients, m.intercept) for m in models]).reshape((n_alphas,11))
Plotting the coefficients with respect to alpha values:
>>> import matplotlib.pyplot as plt
>>> plt.figure()
>>> plt.plot(alphas, coefs)
>>> plt.xscale('log')
>>> plt.xlabel('alpha')
>>> plt.ylabel('weights')
>>> plt.title('coefficients as a function of the regularization')
>>> plt.axis('tight')
>>> plt.show()

Training a Simple Logistic Regression Model¶
This example shows how to compute a logistic regression classifier and evaluate it’s performance using the receiver operating characteristic (ROC) curve.
>>> ds_train = ds_example3.split(0, [80, 20])
>>> ds_test = ds_example3.split(1, [80, 20])
>>> glm = analytics.generalized_logreg(['x1'], 'y', ds_train, affine=True, l1reg=0, l2reg=0.01).run()
>>> cc = analytics.roc(glm, ['x1'], 'y', ds_test, thresholds=32).run()
Plot the ROC and display the area under the ROC:
>>> plt.figure()
>>> plt.plot(cc.fpr, cc.tpr, label='ROC curve (area = %0.2f)' % cc.auc_roc)
>>> plt.plot([0, 1], [0, 1], 'k--')
>>> plt.xlabel('False Positive Rate')
>>> plt.ylabel('True Positive Rate')
>>> plt.title('Receiver operating characteristic example')
>>> plt.legend(loc="lower right")
>>> plt.show()

Training a Random Forest¶
In this example we train a random forest classifier on a binary classification
problem associated to two overlapping gaussian distributions centered at (0,0)
and (3,3)
.
Points around (0,0)
are labeled as in the negative class while points around (3,3)
are
labeled as in the positive class.
>>> ds_example4 = DataSet.from_table('tutorial.twoclass')
>>> rf = analytics.random_forest(['x1', 'x2'], 'y', ds_example4, 100, 1).run()
>>> plot_colors = "br"
>>> plot_step = 0.1
>>>
>>> x_min, x_max = 1.5-8, 1.5+8
>>> y_min, y_max = 1.5-8, 1.5+8
>>> xx, yy = np.meshgrid(
... np.arange(x_min, x_max, plot_step),
... np.arange(y_min, y_max, plot_step)
... )
>>> Z = rf.predict(np.c_[xx.ravel(), yy.ravel()])
>>> Z = Z.reshape(xx.shape)
Plot the decision boundary:
>>> fig, ax = plt.subplots()
>>> plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
>>> # Draw circles centered at the gaussian distributions
>>> ax.add_artist(plt.Circle((0,0), 1.5, color='k', fill=False))
>>> ax.add_artist(plt.Circle((3,3), 1.5, color='k', fill=False))
>>> ax.text(3, 3, '+')
>>> ax.text(0, 0, '-')
>>> plt.xlabel('x')
>>> plt.ylabel('y')
>>> plt.title('Decision Boundary')

This concludes the user tutorial section, so the connection should be closed.
>>> client.close()
>>> client.connected
False
Management and Administration¶
Administration tasks use the Client
class from the
leapyear
module and admin classes from the
leapyear.admin
. These admin classes include:
These classes provide API’s for various administrator tasks on the LeapYear system. All of the examples in the administrative examples section will require correct permissions.
Managing the LeapYear Server¶
Management requires sufficient privileges. The examples below assume the lyadmin user is an administrator of the LeapYear deployment system.
>>> client = Client(url, 'lyadmin', ROOT_PASSWORD)
>>> client.connected
True
User Management¶
User
objects are used as the primary API for managing users. Below
is an example of a user being created, their password updated, and finally
their account is disabled.
>>> # Create the user
>>> user = User('new_user', password)
>>> client.create(user)
>>> 'new_user' in client.users
True
>>>
>>> # Update the user's password
>>> new_password = '{}100'.format(password)
>>> user.update(password=new_password)
<User new_user>
>>>
>>> # Disable the user
>>> user.enabled
True
>>>
>>> user.enabled = False
>>> user.enabled
False
Database Management¶
Database
objects are used to view and manipulate databases on the server.
>>> # create database
>>> client.create(Database('sales'))
>>>
>>> # retrieve a reference to the database
>>> sales_database = client.databases['sales']
>>>
>>> # drop database
>>> client.drop(sales_database)
Table Management¶
Table
objects are used to view and manipulate tables in a database on
the server. Below is an example of how to define a data source (table) object
on the LeapYear server.
>>> credentials = 'hdfs:///path/to/data.parquet'
>>>
>>> # create a table
>>> accounts = Database('accounts')
>>> table = Table('users', credentials=credentials, database=accounts)
>>>
>>> client.create(accounts)
>>> client.create(table)
>>>
>>> # retrieve a reference to the table
>>> users_table = accounts.tables['users']
>>>
>>> # drop a table
>>> client.drop(users_table)
LeapYear Python API¶
Module leapyear¶
The LeapYear Client connects the python API to the LeapYear server.
While connected to the server, the connection information is stored in a resource manager, so direct access to the client is not necessary for all operations.
The main objects that can be directly accessed by the Client
object are databases, users and permissions resources. Many
resources are cached to reduce network latency, however the load()
methods for resource objects will force refreshing of the metadata.
>>> from leapyear import Client
>>> from leapyear.admin import Database, Table, ColumnDefinition
Connecting to the server¶
Method 1 - manually open and close the connection.
>>> SERVER_URL = 'https://ly-server:4401'
>>> client = Client(url=SERVER_URL)
>>> print(client.databases)
{}
>>> client.close()
Method 2 - Use a context to automatically close the connection.
>>> with Client(url=SERVER_URL) as client:
... print(client.databases)
{}
Create a User¶
There are two ways to create an object on the LeapYear server. The first is to
invoke the Client.create()
method.
>>> with Client(url=SERVER_URL) as client:
... client.create(User('new_user', 'password'))
The second is to call the create()
method on the object.
>>> with Client(url=SERVER_URL) as client:
... User('new_user', 'password').create()
Create a Database¶
>>> with Client(url=SERVER_URL) as client:
... db = Database('db_name')
... client.create(db)
... print(db.tables)
{}
Create a Table¶
>>> TABLE_CREDENTIALS = 'hdfs://path/to/data.parq'
>>> columns = [
>>> ColumnDefinition('x', type="REAL", bounds=(-3.0, 3.0)),
>>> ColumnDefinition('y', type="BOOL"),
>>> ]
>>> with Client(url=SERVER_URL) as client:
... tbl = Table(
... 'tbl_name',
... columns=columns,
... credentials=TABLE_CREDENTIALS,
... database=db,
... )
... tbl.create()
The Client class¶
-
class
leapyear.client.
Client
(url='http://localhost:4401', username=None, password=None, *, authenticate=<function ly_login>, default_analysis_caching=True, default_allow_max_budget_allocation=True, public_key_auth=False)¶ A class that wraps a connection to a LeapYear system.
-
__init__
(url='http://localhost:4401', username=None, password=None, *, authenticate=<function ly_login>, default_analysis_caching=True, default_allow_max_budget_allocation=True, public_key_auth=False)¶ Initialize a Client object.
- Parameters
url (
str
) – The URL of LeapYear Core. Automatically set to the value of the LY_API_URL environment variable or http://localhost:4401 if not specified.username (
Optional
[str
]) – Optional username to pass to the authenticate parameterpassword (
Optional
[str
]) – Optional password to pass to the authenticate parameterauthenticate (
AuthenticateFn
) – A callback that should log in the user and return a LeapYear token. By default, logs in with the username and password of a LeapYear user, if username and password are provided. Otherwise, uses the token in the LY_JWT environment variable.default_analysis_caching (
bool
) – Whether to cache analysis results by default.default_allow_max_budget_allocation (
bool
) – Whether to allow running analyses on data sets that will automatically consume near the maximum privacy exposure per computation. If enabled, computations on small data sets that use the maximum privacy exposure will be blocked by default. This behavior can be overwritten at the computation level.public_key_auth (
bool
) – Use public key auth to authenticate requests to the LeapYear API. Requires setup via the API setup instructions.
- Raises
AuthenticationError – when login fails.
InvalidURL – when the url parameter is formatted incorrectly.
-
unpersist_all_relations
()¶ Clear all cached datasets.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
-
property
connected
¶ Check if the client can successfully connect + authenticate to the LeapYear system.
- Return type
-
property
status
¶ Get the status of the Client’s connection to the LeapYear system.
- Return type
ClientStatus
-
create
(obj, ignore_if_exists=False, drop_if_exists=False)¶ Create an object on the LeapYear server.
- Return type
-
property
privacy_profiles
¶ Get all privacy profiles available on the server.
-
property
current_user
¶ Get the currently logged in user.
- Return type
User
-
Module leapyear.admin¶
Administrative objects for LeapYear.
Database class¶
-
class
leapyear.admin.
Database
(name, *, description=None, privacy_profile=None, privacy_limit_window_days=None, db_id=None)¶ Database object.
-
classmethod
all
()¶ Get all databases.
- Returns
Iterator over all of the databases available on the server.
- Return type
all_databases
-
property
views
¶ Get the views of the database.
This property can only be seen by admins
-
property
privacy_params
¶ Get the privacy parameter of the database.
- Return type
PrivacyProfileParams
-
set_privacy_profile
(privacy_profile)¶ Set the database’s privacy profile asynchronously.
- Parameters
privacy_profile (
PrivacyProfile
) – The new privacy profile.
Example
>>> db = c.databases["db1"] >>> pp = c.privacy_profiles["Custom profile 1"] >>> db.set_privacy_profile(pp)
- Return type
-
get_privacy_limit
()¶ Get the database’s privacy limit.
- Return type
PrivacyLimit
-
set_privacy_limit_window
(privacy_limit_window_days)¶ Set the database’s privacy limit window (provided in days).
-
load
()¶ Load the database.
-
create
(*, ignore_if_exists=False)¶ Create the database.
- Return type
Database
-
drop
(*, ignore_missing=False)¶ Drop the database.
-
get_access
(subject=None)¶ Get the access level of the given subject.
-
set_access
(subject, access)¶ Grant the given access level to a subject.
-
classmethod
Table class¶
-
class
leapyear.admin.
Table
(name, *, database, columns=None, credentials=None, description=None, public=None, table_id=None, watch_folder=None, **kwargs)¶ Table object.
-
__init__
(name, *, database, columns=None, credentials=None, description=None, public=None, table_id=None, watch_folder=None, **kwargs)¶ Initialize a Table object.
- Parameters
name – The table name.
columns – The columns to create the Table with. If no columns provided, the schema will be auto detected from the data.
credentials – The credentials to the first data slice to be added to the Table.
description – The table’s description.
database – The database this table belongs to.
public – Whether this table should be a public table.
watch_folder – When True, the ‘credentials’ parameter should point to a directory of parquet files that will be watched for automatic data slice uploads. Only applicable for table creation.
-
property
status_with_error
¶ Get the status of the table and potentially the error information.
- Return type
TableStatus
-
property
description
¶ Get the table’s description.
-
get_privacy_limit
()¶ Show the privacy limit associated with the table.
Returns a value of None when the table is public.
-
set_privacy_limit
(privacy_limit)¶ Set the privacy limit associated with the table.
Throws an error when the table is public.
- Return type
-
property
privacy_spent
¶ Show the privacy spent for the current user on this table as a percentage.
Returns the privacy spent (𝜀) associated with all the information disclosed so far by the LeapYear platform to the current user working with this table. The value is represented as a percentage of the privacy limit (0, 10, 20, …100) set by the administrator. The value can exceed 100% if the admin forcibly lowers the privacy limit below the current user’s privacy spent. No queries can be run on a table where the privacy spent is at or above 100%.
If the table is public, returns None instead.
- Returns
Privacy exposure, expressed as a percentage of the limit.
- Return type
Examples
Review the current level of privacy spent.
>>> from leapyear.admin import Database, Table >>> db = Database('db') >>> t = Table('table', database=db) >>> print(t.privacy_spent) 50
-
get_user_privacy_spent
(user)¶ Show the privacy spent for a user on this table.
Returns the privacy spent (𝜀), as a float, associated with all the information disclosed so far by the LeapYear platform to a user working with this table, and the privacy limit as an (𝜀, 𝛿) pair in a PrivacyLimit object.
Returns None instead, if the table is public.
This method is only available to authorized administrators, or to a user attempting to retrieve their own privacy spent.
-
set_user_privacy_limit
(user, privacy_limit)¶ Allow the administrator to set the privacy limit for a user on this table.
Sets the privacy limit as a (𝜀, 𝛿) pair in a PrivacyLimit object for the user, on this table, that is considered acceptable by the administrator. If this method is not called, the user uses the privacy limit from the table.
If this is called with a public table, nothing happens.
This method is only available to authorized users with system admin privileges.
- Return type
-
load
()¶ Load the table.
-
drop
(*, ignore_missing=False)¶ Drop the table.
-
set_all_columns_access
(subject, access)¶ Set the given access for all columns in the table.
If the table is public, the only legal access levels are full access and no access. Setting any other value will result in an error.
-
add_data_slice
(*args, **kwargs)¶ Add a data slice like add_data_slice_async, except runs synchronously.
- Return type
-
add_data_slice_async
(file_credentials, *, update_column_bounds=False)¶ Add a file to the list of data slices of the table.
- Return type
-
create
(*, ignore_if_exists=False)¶ Create the object synchronously.
Functionally equivalent to
.create_async().wait(max_timeout_sec=None)
.- Return type
AsyncCreateable
-
ColumnDefinition class¶
-
class
leapyear.admin.
ColumnDefinition
(name, *, type, bounds=None, nullable=False, description=None, infer_bounds=False)¶ The definition of a column for creating a Table with an explicit schema.
Example usage:
>>> table = Table( ... columns=[ColumnDefinition("col1", type="INT", bounds=(0, 10))], ... ... ... ) >>> table.create()
Changing values in a
ColumnDefinition
has no effect after a table is created. See theTableColumn
documentation for functions to update column attributes after creating a table.-
type
¶
-
bounds
¶
-
__new__
(**kwargs)¶ Create and return a new object. See help(type) for accurate signature.
-
TableColumn class¶
-
class
leapyear.admin.
TableColumn
(*, database, table, id, name, type, bounds, nullable, description)¶ A column in a table.
-
property
type
¶ Get the type of the column.
- Return type
ColumnType
-
property
bounds
¶ Get the bounds of the column.
-
update
(**kwargs)¶ Update the Column’s type, bounds, or nullable.
All of the parameters are optional. If anything is not provided, it’s left unchanged.
- Parameters
type (Union[ColumnType, str]) –
bounds (ColumnBounds) –
nullable (bool) –
infer_bounds (bool) –
- Return type
-
get_access
(subject=None)¶ Get the access level of the given subject.
- Parameters
subject – A User or Group object. If none is provided, use the currently logged in user.
-
set_access
(subject, access)¶ Grant the given access level to a subject.
If this is a column of a public table, only Full Access and No Access are legal values. Setting any other value will result in an error.
-
property
-
class
leapyear.admin.
ColumnType
(value)¶ A column type.
-
BOOL
= 'BOOL'¶ A BOOL column has no bounds.
-
INT
= 'INT'¶ An INT column whose bounds should be a
(int, int)
pair.
-
REAL
= 'REAL'¶ A REAL column whose bounds should be a
(float, float)
pair.
-
FACTOR
= 'FACTOR'¶ A FACTOR column whose bounds should be a list of strings.
-
TEXT
= 'TEXT'¶ A TEXT column has no bounds.
-
DATE
= 'DATE'¶ A DATE column whose bounds should be a
(date, date)
pair, containing dates of the form1970-01-31
.
-
DATETIME
= 'DATETIME'¶ A DATETIME column whose bounds should be a
(datetime, datetime)
pair, containing datetimes of the form1970-01-31T00﹕00﹕00
.
-
ID
= 'ID'¶ An ID column has no bounds.
-
-
leapyear.admin.
ColumnBounds
¶ A type alias representing the union of all possible column bounds described in
ColumnType
View class¶
-
class
leapyear.admin.
View
(name, *, database, dataset, num_partitions=1, partitioning_columns=[], sort_within_partitions_by_columns=[], nominal_partitioning_columns=[], description=None, **kwargs)¶ View object.
A view is a dataset that can be persisted on disk (materialized), across restarts of the LeapYear application. Analysts referencing a materialized view will be using the dataset that is on disk, instead of re-calculating any transformations defined on the dataset.
A guide on how to use views can be found here.
Analysts should load views either from
Database.views
or using thedatabase.view
notation; for example:>>> db = client.databases['db1'] >>> view1 = db.views['view1'] >>> ds1 = DataSet.from_view(view1)
>>> ds2 = DataSet.from_view('db1.view1')
- Parameters
name (
str
) – The view’s name. Views must have unique names, including de-materialized views. View names cannot include any of these characters:,;{}()="
, or newlines (\n
), or tabs (\t
)database (
Union
[str
,Database
]) – The database that the view belongs to. This should be the database that the tables referenced in the DataSet belong to.dataset (
DataSet
) – The DataSet that will be stored as a view.num_partitions – The number of partitions that the view will be split into. This will only be used if partitioning_columns is also set.
partitioning_columns – The columns by which to bucket (cluster) the view into partitions. This must be used with num_partitions. The view will have num_partitions number of partitions, and records with the same values for the partitioning_columns will be in the same partition.
sort_within_partitions_by_columns – The columns used to sort rows within each partition.
nominal_partitioning_columns – The columns by which to partition the view. This should be used by itself, without any other partition parameters.
-
dematerialize
()¶ Dematerialize the view.
This is the preferred method to free disk space used by a view.
-
load
()¶ Load the view.
-
create_async
()¶ Create the view asynchronously.
- Return type
AsyncCreateJob
-
drop
(*, ignore_missing=False)¶ Drop (and unregister) the view.
Admins should NOT drop a view unless they wish to also discard the entries in the analysis cache associated with that view. Instead, admins should use the dematerialize method.
-
create
(*, ignore_if_exists=False)¶ Create the object synchronously.
Functionally equivalent to
.create_async().wait(max_timeout_sec=None)
.- Return type
AsyncCreateable
User class¶
-
class
leapyear.admin.
User
(username, password=None, *, is_root=None, enabled=None, user_id=None, subj_id=None)¶ User object.
-
property
groups
¶ Get the groups of a user.
- Returns
All groups of the user on the LeapYear server.
- Return type
List[Group]
-
load
()¶ Load the information for the user.
- Return type
User
-
create
(*, ignore_if_exists=False)¶ Create the user.
- Return type
User
-
update
(*, password=None, enabled=None)¶ Update the user.
- Return type
User
-
property
Privacy Profile class¶
-
class
leapyear.admin.
PrivacyProfile
(name, *, params=None, hidden=None, verified=None, description=None, profile_id=None)¶ PrivacyProfile object.
-
classmethod
get_latest_verified
()¶ Get the latest verified privacy profile.
- Return type
PrivacyProfile
Get whether the profile is hidden in the Data Manager.
- Return type
-
property
params
¶ Get the parameters of the privacy profile.
- Return type
PrivacyProfileParams
-
load
()¶ Load the privacy profile.
-
create
(*, ignore_if_exists=False)¶ Create the privacy profile.
- Return type
PrivacyProfile
-
update
(params=None, hidden=None)¶ Update the privacy profile’s params.
- Parameters
params – The parameters to be updated.
hidden – Whether or not the privacy profile should be hidden in Data Manager.
-
classmethod
Permission objects¶
-
class
leapyear.admin.
DatabaseAccessType
(value)¶ AccessType for Databases.
-
NO_ACCESS_TO_DB
= 'NO_ACCESS_TO_DB'¶ Prevents user from accessing database
-
SHOW_DATABASE
= 'SHOW_DATABASE'¶ Allows a user to see this database and the tables it contains, including their public metadata
-
ADMINISTER_DATABASE
= 'ADMINISTER_DATABASE'¶ Allows a user to administer this database - e.g. add data sources, grant user access
-
-
class
leapyear.admin.
ColumnAccessType
(value)¶ AccessType for Columns.
-
NO_ACCESS
= 'NO_ACCESS'¶ Prevents user from accessing column
-
COMPUTE
= 'COMPUTE'¶ Allows a user to run randomized computations
-
FULL_ACCESS
= 'FULL_ACCESS'¶ Allows a user to run randomized computations and view and retrieve raw data
-
COMPARE
= 'COMPARE'¶
-
Module leapyear.admin.grants¶
Convenience functions for retrieving grants on resources.
Access Summaries¶
-
class
leapyear.admin.grants.
DatabaseAccess
(subject: Union[leapyear.admin.user.User, leapyear.admin.group.Group], database: leapyear.admin.database.Database, access: leapyear.admin.database.DatabaseAccessType)¶ Access between a subject and a database.
-
subject
: Union[leapyear.admin.user.User, leapyear.admin.group.Group]¶ Subject
-
database
: leapyear.admin.database.Database¶ Database
-
access
: leapyear.admin.database.DatabaseAccessType¶ DatabaseAccessType
-
-
class
leapyear.admin.grants.
TableAccess
(subject: Union[leapyear.admin.user.User, leapyear.admin.group.Group], table: leapyear.admin.table.Table, columns: Mapping[str, leapyear.admin.table.ColumnAccessType])¶ Access between a subject and a table.
-
subject
: Union[leapyear.admin.user.User, leapyear.admin.group.Group]¶ Subject
-
table
: leapyear.admin.table.Table¶ Table
-
Functions¶
-
leapyear.admin.grants.
all_access_on_database
(db)¶ Fetch the database access for all users or groups.
Examples
>>> db = client.databases['db'] >>> for subject, access_db, access in all_grants_on_database(db): ... assert isinstance(subject, User) or isinstance(subject, Group) ... assert access_db == db ... assert isinstance(access, DatabaseAccessType) ...
- Parameters
db (
Database
) –- Returns
- Return type
List[DatabaseAccess]
-
leapyear.admin.grants.
all_access_on_table
(table)¶ Fetch the column access for all columns in the given table for all users or groups.
Examples
>>> table = client.databases['db'].tables['table'] >>> for subject, access_table, columns in all_access_on_table(table): ... assert isinstance(subject, User) or isinstance(subject, Group) ... assert access_table == table ... for column_name, column_access in columns.items(): ... assert isinstance(column_name, str) ... assert isinstance(column_access, ColumnAccessType) ...
- Parameters
table (
Table
) –- Returns
- Return type
List[TableAccess]
-
leapyear.admin.grants.
all_database_accesses_for_subject
(subject)¶ Fetch the access of all databases for the given subject.
Examples
>>> user = client.users['user'] >>> for subject, db, access in all_database_accesses_for_subject(user): ... assert subject == user ... assert isinstance(db, Database) ... assert isinstance(access, DatabaseAccessType) ...
- Parameters
subject (
Union
[User
,Group
]) –- Returns
- Return type
List[DatabaseAccess]
Module leapyear.jobs¶
Classes for managing asynchronous jobs in LeapYear Core.
Inspecting Job status¶
-
class
leapyear.jobs.
AsyncJob
(async_job_id, *, on_success=None)¶ A helper for polling the result of an asynchronous job.
Represents an accessor for an asynchronous job on the LeapYear server. It contains a unique job_id for a specific job and methods to check the results of or cancel the job at a later time.
-
class
leapyear.jobs.
AsyncJobStatus
(status: AsyncJobState, result: Optional[ResultOrError], start_time: datetime, end_time: Optional[datetime])¶ Result of checking the status of an asynchronous job.
-
status
¶ The current status of the job.
- Return type
AsyncJobState
-
result
¶ None
if the job is still running, otherwise either theJobResult
or aAPIError
error.- Return type
Optional[ResultOrError]
-
start_time
¶ The time the job was started.
- Return type
datetime
-
end_time
¶ The time the job finished, or
None
if the job is still running.- Return type
Optional[datetime]
-
Module leapyear.dataset¶
DataSet and Attribute.
DataSets combine a data source (Table) with data transformations which subsequent computations can be performed on. A data set has a schema which describes the types of attributes (columns) in the data source. Attributes can be manipulated as if they were built-in python types (int, float, bool, …). DataSet also provides the following lazy methods for transforming data:
project()
: select one or more attributes (columns) from the data source.
where()
: select rows that satisfy a filter condition.
union()
: combine two data sets with matching schemas.
split()
: create subsets whose size is a fraction of the total data size.
splits()
: yield all fractional partitions of the data set.
stratified()
: create subsets of the data with fixed attribute prevalence.
group_by()
: create aggregated views of the data set.
join()
: combine rows of two datasets where certain data elements match.
transform()
: apply a linear transformation to the specified data elements.
Further details for each transformation can be found in their respective documentation below. Transformations are lazy in the sense that they are not evaluated until a computation is executed; however, each transformation requires DataSet schema to be re-evaluated, which relies on a live connection to the LeapYear server.
Computations and machine learning analytics that process the data set are found
in leapyear.analytics
.
The examples below rely on a connection to LeapYear server, which can be established as follows:
>>> from leapyear import Client
>>> client = Client(url='http://ly-server:4401', username='admin', password='password')
Examples
Load a dataset and examine its schema:
>>> pds = DataSet.from_table('db.table')
>>> pds.schema
OrderedDict([
('attr1', Type(tag=BOOL, contents=())),
('attr2', Type(tag=INT, contents=(0, 20))),
('attr3', Type(tag=REAL, contents=(-1, 1))),
])
>>> pds.attributes
['attr1', 'attr2', 'attr3']
Create a new attribute and find its type:
>>> pds['attr2_gt_10'] = pds['attr2'] > 10
>>> pds.attributes
['attr1', 'attr2', 'attr3', 'attr2_gt_10']
>>> pds.schema['attr2_gt_10']
Type(tag=BOOL, contents=())
Select instances using a predicate:
>>> pds_positive_attr3 = pds.where(pds['attr3'] > 0)
>>> pds_positive_attr3.schema['attr3']
Type(tag=REAL, contents=(0, 1))
Calculate the mean of a single attribute:
>>> import leapyear.analytics as analytics
>>> mean_analysis = analytics.mean(pds_positive_attr3['attr3'])
>>> mean_analysis
computation: MEAN(attr3 > 0)
attributes:
attr3: db.table.attr3 (0 <= x <= 1)
>>> mean_analysis.run() # run the computation on the LeapYear server.
0.02
DataSet class¶
-
class
leapyear.dataset.
DataSet
(relation)¶ DataSet object.
-
property
relation
¶ Get the DataSet relation.
- Return type
Relation
-
property
schema
¶ Get the DataSet schema, including data types, values allowed.
- Return type
Schema
-
classmethod
from_view
(cls, view)¶ Create a dataset from a view.
-
classmethod
from_table
(table, *, slices=None, all_slices=False)¶ Create a dataset from a table.
- Parameters
table – The table or the name of the table on the LeapYear server.
slices – A list of ranges of table slices to use.
all_slices – Set to True to use all slices in the Table at the time this is run. Note that if a new slice has been added, this will create a DataSet with the new slice, causing analyses to miss the analysis cache when rerunning.
- Returns
The dataset with a LeapYear table as its source.
- Return type
-
get_attribute
(key)¶ Select one attribute from the data set.
-
drop_attributes
(keys)¶ Drop Attributes.
-
drop_attribute
(key)¶ Drop an attribute.
-
with_attributes
(name_attrs)¶ Return a new DataSet with additional attributes.
-
with_attribute
(name, attr)¶ Return a new DataSet with an additional attribute.
-
with_attributes_renamed
(rename)¶ Rename attributes using a mapping.
- Parameters
rename (
Mapping
[str
,str
]) – Dictionary mapping old names to new names.- Returns
New DataSet with renamed attributes.
- Return type
Examples
1. Create new dataset ds2 with renamed columns ‘ZipCode’ and ‘Name’ which were named ‘zip_code’ and ‘name’ respectively. Similarly, rename attributes in another dataset ds3. Then, finally these datasets can be merged using union:
>>> ds2 = ds1.with_attributes_renamed({'zip_code':'ZipCode','name':'Name'}) >>> ds4 = ds3.with_attributes_renamed({'zipcode':'ZipCode','name_str':'Name'}) >>> ds4 = ds4.union(ds2)
-
with_attribute_renamed
(old_name, new_name)¶ Rename an attribute.
-
map_attributes
(name_lambdas)¶ Map attributes and create a new DataSet.
-
map_attribute
(name, func)¶ Map an attribute and create a new DataSet.
-
project
(new_keys)¶ Select multiple attributes from the data set.
Projection (π): Creates a new DataSet with only selected attributes from this DataSet.
-
select
(*attrs)¶ Create a new DataSet by selecting column names or Attributes.
-
select_as
(mapping)¶ Create a new DataSet by mapping from column names or Attributes to new names.
-
where
(clause)¶ Select/filter rows using a filter expression.
Selection (σ): LeapYear’s where clause creates a new DataSet based on the filter condition. Its schema may include smaller domain of possible values.
- Parameters
clause (
Attribute
) – A filter Attribute: a single attribute of type BOOL.- Returns
A filtered DataSet.
- Return type
-
union
(other, distinct=False)¶ Union or concatenate datasets.
Union (∪): Concatenates data sets with matching schema.
- Parameters
other (
DataSet
) –The dataset to union with. Note: schema of other must match that of self, including the order of the attributes. The order of attributes can be aligned like so:
>>> ds2 = ds2[list(ds1.schema.keys())] >>> ds_union = ds1.union(ds2)
distinct (
bool
) – Remove duplicate rows.
- Returns
A combined dataset.
- Return type
-
join
(other, on, right_on=None, join_type='inner', unique_left_keys=False, unique_right_keys=False, left_suffix='', right_suffix='', **kwargs)¶ Combine two DataSets by joining using a key, as in SQL JOIN statement.
If right_k (or left_k) is specified, the right (or left) data set will include no more than k rows for each matching row in the left (or right) data set.
- Parameters
other (
DataSet
) – The other DataSet to join with.on (
Union
[str
,List
[str
]]) – The key(s) to join on. If using suffixes, the suffix must NOT be appended by the caller.right_on (
Union
[str
,List
[str
],None
]) – The key(s) on the right table, if different than those in on. None if both tables have the same key names. If using suffixes, the suffix must NOT be appended by the caller.join_type (
str
) –LeapYear supports a variety of joins listed in the left column in the table below. Any value of
join_type
from the right column will use that particular join. If no value is specified, an inner join will be used. Aleft_outer_public
join can be run only on a right public table.Join
join_type
Inner
"inner"
(default)Outer
"outer"
,"full"
,"full_outer"
, or"fullouter"
Left
"left"
,"left_outer"
, or"leftouter"
Right
"right"
,"right_outer"
, or"rightouter"
Left Antijoin
"left_anti"
or"leftanti"
Left Semijoin
"left_semi"
or"leftsemi"
Left Outer Public
"left_outer_public"
unique_left_keys (
bool
) – If the left DataSet is known to have unique keys, setting this to True will run an optimized join algorithms. Warning: Setting this to True if the keys are not unique will cause data loss!unique_right_keys (
bool
) – If the right DataSet is known to have unique keys, setting this to True will run an optimized join algorithms. Warning: Setting this to True if the keys are not unique will cause some data rows to not show up in the output DataSet!left_suffix (
str
) – Optional suffix to append to the column names of the left DataSet.right_suffix (
str
) – Optional suffix to append to the column names of the right DataSet.-> cache (kwargs) – Optional StorageLevel for persisting intermediate datasets for performance enhancement. The meaning of these values is documented in the Spark RDD programming guide.
- Returns
The joined DataSet.
- Return type
Examples
Joining two datasets on a common key:
>>> ds1 = example_ds1.project(['key', 'col1']) >>> ds2 = example_ds2.project(['key', 'col2']) >>> ds = ds1.join(ds2, 'key') >>> list(ds.schema.keys()) ['key', 'col1', 'col2']
Joining two datasets on a single key but with a different name on the right:
>>> ds1 = example_ds1.project(['key1', 'col1']) >>> ds2 = example_ds2.project(['key2', 'col2']) >>> ds = ds1.join(ds2, 'key1', right_on='key2') >>> list(ds.schema.keys()) ['key1', 'col1', 'key2', 'col2']
Joining when a column is duplicated is an error:
>>> ds1 = example_ds1.project(['key', 'col1', 'col3']) >>> ds2 = example_ds2.project(['key', 'col2', 'col3']) >>> ds = ds1.join(ds2, 'key') APIError: Invalid schema: repeated aliases: col3
4. Joining when the key is missing from one relation: Fails because in this case a right key has to be specified.
>>> ds1 = example_ds1.project(['key1', 'col1']) >>> ds2 = example_ds2.project(['key2', 'col2']) >>> ds = ds1.join(ds2, 'key1') APIError: Error parsing scope, missing variable declaration for `key1`
Joining with multiple keys:
>>> ds1 = example_ds1.project(['key1_1', 'key2_1', 'col1']) >>> ds2 = example_ds2.project(['key1_2', 'key2_2', 'col2']) >>> ds = ds1.join(ds2, ['key1_1', 'key2_1'], right_on=['key1_2', 'key2_2']) >>> list(ds.schema.keys()) ['key1_1', 'key2_1', 'col1', 'key1_2', 'key2_2', 'col2']
Joining with different number of keys results in an error:
>>> ds1 = example_ds1.project(['key1_1', 'key2_1', 'col1']) >>> ds2 = example_ds2.project(['key1_2', 'key2_2', 'col2']) >>> ds = ds1.join(ds2, ['key1_1', 'key2_1'], right_on='key1_2') APIError: Invalid schema: join key length mismatch ...
Joining with specifying that the keys are unique:
>>> ds1 = example_ds1.project(['key1', 'col1']) >>> ds2 = example_ds2.project(['key1', 'col2']) >>> ds = ds1.join(ds2, 'key1', unique_left_keys=True, unique_right_keys=True) >>> list(ds.schema.keys()) ['key1', 'col1', 'col2']
Different join types:
>>> ds1 = example_ds1.project(['key1', 'col1']) >>> ds2 = example_ds2.project(['key1', 'col2']) >>> ds = ds1.join(ds2, 'key1', join_type='outer') >>> list(ds.schema.keys()) ['key1', 'col1', 'col2']
Left semi join:
>>> ds1 = example_ds1.project(['key1', 'col1']) >>> ds2 = example_ds2.project(['key1', 'col2']) >>> ds = ds1.join(ds2, 'key1', join_type='left_semi') >>> list(ds.schema.keys()) ['key1', 'col1']
Joining when the nullability of keys is different:
>>> ds1 = example_ds1.select([col('key1').decode({0: 0}).alias('nkey1'), 'col1']) >>> ds2 = example_ds2.project(['key2', 'col2']) >>> ds = ds1.join(ds2, 'nkey', right_on='key2') >>> list(ds.schema.keys()) ['nkey1', 'col1', 'key2', 'col2']
Joining when keys have different but coercible types:
>>> realKey1 = col('intKey1').as_real().alias('realKey1') >>> ds1 = example_ds1.select([realKey1, 'col1']) >>> ds2 = example_ds2.project(['intKey2', 'col2']) >>> ds = ds1.join(ds2, 'realKey1', right_on='intKey2') >>> list(ds.schema.keys()) ['intKey1', 'col1', 'intKey2', 'col2']
Joining when keys are factors (upcasted to common type):
>>> keyFactor1 = col('key1').decode({k: 'A' for k in range(10)}). >>> as_factor().alias('keyFactor1') >>> keyFactor2 = col('key2').decode({k: 'B' for k in range(10)}). >>> as_factor().alias('keyFactor2') >>> ds1 = example_ds1.select([keyFactor1, 'col1']) >>> ds2 = example_ds2.select([keyFactor2, 'col2']) >>> ds = ds1.join(ds2, 'keyFactor1', right_on='keyFactor2') >>> list(ds.schema.keys()) ['key1', 'col1', 'key2', 'col2']
Joining when the keys have mismatched types is an error (e.g. factor and bool):
>>> keyFactor1 = col('key1').as_factor().alias('keyFactor1') >>> ds1 = example_ds1.project(['key1', 'col1']) >>> ds2 = example_ds2.project(['key2', 'col2']) >>> ds = ds1.join(ds3, 'keyFactor1', right_on='key2') APIError: Invalid schema: join column type mismatch ...
Joining with suffixes to disambiguate column names:
>>> ds1 = example_ds1.project(['key', 'col1']) >>> ds2 = example_ds2.project(['key', 'col2']) >>> ds = ds1.join(ds2, 'key', left_suffix='_l', right_suffix='_r') >>> list(ds.schema.keys()) ['key_l', 'col1_l', 'key_r', 'col2_r']
Joining with only left suffixes to disambiguate column names:
>>> ds1 = example_ds1.project(['key', 'col1']) >>> ds2 = example_ds2.project(['key', 'col2']) >>> ds = ds1.join(ds2, 'key', left_suffix='_l') >>> list(ds.schema.keys()) ['key_l', 'col1_l', 'key', 'col2']
Joining with only right suffixes to disambiguate column names:
>>> ds1 = example_ds1.project(['key', 'col1']) >>> ds2 = example_ds2.project(['key', 'col2']) >>> ds = ds1.join(ds2, 'key', right_suffix='_r') >>> list(ds.schema.keys()) ['key', 'col1', 'key_r', 'col2_r']
Joining with specific cache level for intermediate caches:
>>> ds = ds1.join(ds2, on="key", cache=StorageLevel.DISK_ONLY, n_partitions=1)
- Unsupported Backends
Conditional support for the following LeapYear compute backend(s): snowflake — All functionality available except for the ‘cache’ keyword arg.
-
prepare_join
(join_on, k, n_partitions)¶ Prepare DataSet for join.
- Parameters
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
DataSet
-
unpersist_join_cache
(other, on, right_on=None, join_type='inner', unique_left_keys=False, unique_right_keys=False, left_suffix='', right_suffix='', **kwargs)¶ Unpersist intermediate caches used by .join().
NOTE: use same parameters as join call to unpersist
- Parameters
other (
DataSet
) – The other DataSet to join with.on (
Union
[str
,List
[str
]]) – The key(s) to join on. If using suffixes, the suffix must NOT be appended by the caller.right_on (
Union
[str
,List
[str
],None
]) – The key(s) on the right table, if different than those in on. None if both tables have the same key names. If using suffixes, the suffix must NOT be appended by the caller.join_type (
str
) –LeapYear supports a variety of joins listed in the left column in the table below. Any value of
join_type
from the right column will use that particular join. If no value is specified, an inner join will be used. Aleft_outer_public
join can be run only on a right public table.Join
join_type
Inner
"inner"
(default)Outer
"outer"
,"full"
,"full_outer"
,or
"fullouter"
Left
"left"
,"left_outer"
, or"leftouter"
Right
"right"
,"right_outer"
, or"rightouter"
Left Antijoin
"left_anti"
or"leftanti"
Left Semijoin
"left_semi"
or"leftsemi"
Left Outer Public
"left_outer_public"
unique_left_keys (
bool
) – If the left DataSet is known to have unique keys, setting this to True will run an optimized join algorithms. Warning: Setting this to True if the keys are not unique will cause data loss!unique_right_keys (
bool
) – If the right DataSet is known to have unique keys, setting this to True will run an optimized join algorithms. Warning: Setting this to True if the keys are not unique will cause some data rows to not show up in the output DataSet!left_suffix (
str
) – Optional suffix to append to the column names of the left DataSet.right_suffix (
str
) – Optional suffix to append to the column names of the right DataSet.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
DataSet
-
group_by
(*grouping)¶ Aggregate data by a categorical column(s).
- Parameters
grouping (
Union
[str
,Attribute
]) – The attribute or attributes to group by, separated by comma.- Returns
A new
GroupedData
object with groupings as specified. It can be used with agg function to create a DataSet with derived aggregate attributes, see examples below.- Return type
Examples
1. Group by multiple columns (‘col1’ and ‘col2’) in Dataset ds and aggregate ‘col3’ and ‘col4’:
>>> ds_group = ds.group_by('col1', 'col2').agg((['col3'], 'count'), (['col4'], 'count'))
Group by single column ‘col1’ in Dataset and compute aggregate of ‘col3’ and ‘col4’:
>>> ds_group = ds.group_by('col1').agg((['col3'], 'max'), (['col4'], 'mean'))
-
split
(index, proportions, complement=False)¶ Split and select one partition of the dataset.
Selection (σ): Splits are specified by proportions.
- Parameters
index (
int
) – The split numberproportions (
List
[int
]) – The proportions to split the dataset by, represented as a list of integers. For example, [1,1,2] will split into 3 datasets with 1/4, 1/4 and 1/2 of the data in each, respectively.complement (
bool
) – If True, returns a DataSet which is the complement of the split (e.g. all rows not in the split). Default:False
.
- Returns
The ith-partition of the dataset.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
splits
(proportions)¶ Split the dataset and return an iterator over the resulting partitions.
Selection (σ): Splits are specified by proportions.
- Parameters
proportions (
List
[int
]) – The proportions to split the dataset by, represented as a list of integers. For example, [1,1,2] will split into 3 datasets with 1/4, 1/4 and 1/2 of the data in each, respectively.- Returns
Iterator over the DataSet objects representing partitions.
- Return type
Iterator[DataSet]
Examples
Create 80/20 split of a Dataset ds1:
>>> traintest, holdout = ds1.splits((8,2))
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
stratified_split
(index, proportions, stratified, complement=False)¶ Split by stratifying a categorical attribute.
Selection (σ): Each split will contain approximately the same proportion of values from each category.
As an example, for Boolean stratification, each split will contain the same proportion of True/False values.
- Parameters
index (
int
) – The split numberproportions (
List
[int
]) – The proportions to split the dataset by, represented as a list of integers. For example, [1,1,2] will split into 3 datasets with 1/4, 1/4 and 1/2 of the data in each, respectively.stratified (
str
) – The column to stratify against. Must be Boolean or Factor. Must not be nullable.complement (
bool
) – If True, returns a DataSet which is the complement of the split (e.g. all rows not in the split). Default:False
.
- Returns
The ith-partition of the dataset.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
stratified_splits
(proportions, stratified)¶ Split by stratifying a categorical attribute.
Selection (σ). For boolean stratification, each split will maintain the same proportion of True/False values.
- Parameters
proportions (
List
[int
]) – The proportions to split the dataset by, represented as a list of integers. For example, [1,1,2] will split into 3 datasets with 1/4, 1/4 and 1/2 of the data in each, respectively.stratified (
str
) – The column to stratify against. Must be Boolean or Factor. Must not be nullable.
- Returns
Iterator over the DataSet objects representing partitions.
- Return type
Iterable[DataSet]
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
kfold
(n_folds=3)¶ Split the dataset into train/test pairs using k-fold strategy.
Each fold can then be used as a validation set once while
k-1
remaining folds form the training set. If the dataset has size N, each (train, test) pair will be sized N*(k-1)/k and N/k respectively.
-
stratified_kfold
(stratified, n_folds=3)¶ Split the dataset into train/test pairs using k-fold stratified splits.
Each fold can then be used as a validation set once while
k-1
remaining folds form the training set. If the dataset has size N, each (train, test) pair will be sized N*(k-1)/k and N/k respectively.- Parameters
- Returns
The iterator over the k pairs of (train, test) partitions.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
rows_async
(*, limit=None)¶ Retrieve rows.
If the user has permission to do so, the function returns a generator of OrderedDict objects representing the attribute names and values from each row. The generator requires connection to the server over its entire lifetime.
- Parameters
- Returns
The iterator over the rows of the input dataset, each row being represented as an OrderedDict objects mapping attribute names to their values in this row.
- Return type
Iterator[Mapping[str, Value]]
-
rows
(limit=None, max_timeout_sec=None)¶ Retrieve rows.
Same as
rows_async()
, except waits on the job.
-
rows_pandas
(limit=None)¶ Retrieve rows as a pandas DataFrame.
-
head_pandas
(n=10)¶ Retrieve rows, if the user has permission to do so, see
rows_pandas()
.- Parameters
n (
int
) – Maximum number of rows to output.- Returns
The rows of the input dataset.
- Return type
DataFrame
-
example_rows
()¶ Retrieve 10 rows of example data from the DataSet.
The returned data is based only on the public metadata. The generated data does not depend on the true data at all and should not be used for data inference.
The function requires an active connection to the LeapYear server.
Does not support TEXT attributes - consider dropping them using
drop_attributes()
before runningexample_rows()
.- Returns
The iterator over the rows of the generated dataset; each row is represented as an OrderedDict objects mapping attribute names to their generated values.
- Return type
Iterator[Mapping[str, Any]]
Example
>>> pds = DataSet.from_table('db.table') >>> pds.example_rows()
To turn the data into a pandas DataFrame, use
>>> import pandas >>> df = pandas.DataFrame.from_dict(pds.example_rows())
Alternatively, use
example_rows_pandas()
.
-
example_rows_pandas
()¶ Retrieve 10 rows of example data from the DataSet. See
example_rows()
.- Returns
The example rows.
- Return type
DataFrame
-
transform
(attrs, transformation, name)¶ Apply linear transformation to a list of attributes based on a matrix.
- Parameters
attrs (
List
[Attribute
]) – Expressions to use as input to the matrix multiplication, in order.transformation (
List
[List
[float
]]) – Matrix to use to define linear transformation via matrix multiplication - e.g. output ofleapyear.analytics.pca()
.name (
str
) – Common prefix for the name of the attributes to be created.
- Returns
DataSet with transformed attributes appended.
- Return type
-
sample
(fraction, *, with_replacement=False, seed=None)¶ Return a sampled subset of the rows.
- Parameters
- Returns
DataSet containing sampled subset of the rows.
- Return type
-
distinct
()¶ Return a DataSet that contains only the unique rows from this Dataset.
- Return type
DataSet
-
drop_duplicates
(subset=None)¶ Return a DataSet with duplicates in the provided columns dropped.
If all columns are named or subset is
None
, this is equivalent todistinct()
.Errors when subset is an empty list.
-
except_
(ds)¶ Return a DataSet of rows in this DataSet but not in another DataSet.
- Parameters
ds (
DataSet
) – DataSet to compare with.- Returns
DataSet containing rows in this DataSet but not in another DataSet.
- Return type
-
difference
(ds)¶ Return a DataSet of rows in this DataSet but not in another DataSet.
- Parameters
ds (
DataSet
) – DataSet to compare with.- Returns
DataSet containing rows in this DataSet but not in another DataSet.
- Return type
-
symmetric_difference
(ds)¶ Return the symmetric difference of the two DataSets.
- Parameters
ds (
DataSet
) – DataSet to compare with.- Returns
DataSet containing rows that are not in the intersection of the two DataSets.
- Return type
-
intersect
(ds)¶ Return intersection of the two DataSets.
- Parameters
ds (
DataSet
) – DataSet to compare with.- Returns
DataSet containing rows that are in the intersection of the two DataSets.
- Return type
-
order_by
(*attrs)¶ Order DataSet by the given expressions.
-
limit
(n)¶ Limit the number of rows.
-
cache
(storageLevel=StorageLevel.MEMORY_AND_DISK)¶ Cache a Dataset on disk on the server-side.
- Parameters
storageLevel (
StorageLevel
) –StorageLevel for persisting datasets. The meaning of these values is documented in the Spark RDD programming guide.
- Returns
DataSet, which indicates to the system to lazily cache the DataSet
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
unpersist
()¶ Immediately begins unpersisting the DataSet on the server-side.
- Returns
- Return type
None
Example
Build a cache and unpersist it
>>> la.count_rows(ds.cache()).run() >>> la.mean(ds["foo"]).run() # this will hit the cache >>> ds.unpersist() >>> la.sum(ds["foo"]).run() # this will no longer hit the cache
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
repartition
(numPartitions, *attrs)¶ Repartition a Dataset by hashing the columns.
- Returns
DataSet partitioned according to the specified parameters.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
sortWithinPartitions
(*attrs)¶ Sorts Dataset rows by the provided columns within partitions, not globally.
- Returns
DataSet sorted within partitions.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
replace
(col, replacement)¶ Replace values matching keys in replacement map.
- Parameters
- Returns
New Dataset where specified values in a given column are replaced according to the replacement map.
- Return type
-
fill
(col, value)¶ Fill missing values in a given column.
- Parameters
- Returns
A new DataSet where
NULL
values in a given column are replaced with the specified expression.- Return type
-
drop
(*cols, how='any')¶ Drop rows where specified columns contain
NULL
values.- Parameters
- Returns
A new DataSet filtered to rows where specified columns contain
NULL
values (any or all, depending on the value of the how parameter).- Return type
-
join_pandas
(from_col, dataframe, key_col, value_col)¶ Join a column of pandas data with the data set.
See
join_data
for extended details.- Parameters
from_col (
Union
[Attribute
,str
]) – An expression or attribute name in the data set to join the data to. The type of this column should match the keys in the mapping.dataframe (
DataFrame
) – The pandas DataFrame that contains values to map to in this data set. The DataFrame cannot be empty and has a limit of 100,000 rows.key_col (
str
) – The name of the pandas column to obtain the keys to matchfrom_col
. If duplicated keys occur, only one key from the data set is used in the join.value_col (
str
) – The name of the pandas column to obtain the values of the mapping.
- Returns
The original data set with a new column containing analyst-supplied values.
- Return type
-
join_data
(from_col, new_col, mapping)¶ Join key-value data to the data set.
Analyst supplied data can be added to an existing sensitive data set without a loss to privacy. This is a replacement for the
decode
expression when there are many keys (>100).Values matching the keys of the dictionary
mapping
are replaced by the associated values. Values that do not match any of the keys are replaced byNULL
. Keys and values may be python literals supported by the LeapYear client, or other Attributes.Note
If the combination of keys assure there should be no
NULL
values, the client will not automatically convert the result of join_data to a non-nullable type. The user must use coalesce to removeNULL
from the domain of possible values.- Parameters
from_col (
Union
[Attribute
,str
]) – An expression or attribute name in the data set to join the data to. The type of this column should match the keys in the mapping.new_col (
str
) – The name of the new attribute that is added to the returned data set with the same type as the mapping values.mapping (
Mapping
[Union
[float
,bool
,int
,str
,datetime
,date
],Union
[float
,bool
,int
,str
,datetime
,date
]]) – Python dictionary of key-value pairs to add to the data set. The keys should be unique. The mapping cannot be empty and has a limit of 100,000 keys.
- Returns
The original data set with a new column containing analyst-supplied values.
- Return type
Example
Suppose that we have a table with a
Sex
column containing the valuesmale
andfemale
:Sex
Age
male
22
female
38
female
26
female
35
male
35
We’d like to encode the abbreviations (coming from the first letter) as a new column, coming from a pandas DataFrame of the following form:
Sex
first_letter
male
m
female
f
To do so, we call this function, assuming the LeapYear DataSet is called
ds
, and we wish to call the new columnsex_first_letter
:new_ds = ds.join_data("Sex", "sex_first_letter", {"female": "f", "male": "m"})
This will produce a DataSet which looks like:
Sex
Age
sex_first_letter
male
22
m
female
38
f
female
26
f
female
35
f
male
35
m
-
predict
(model, attrs=None, name='predict')¶ Return a DataSet with a prediction column for the model.
- Parameters
model (
Union
[ClusterModel
,GLM
,RandomForestClassifier
,RandomForestRegressor
,GradientBoostedTreeClassifier
]) – The model to predict outcomes with.attrs (
Optional
[List
[Union
[Attribute
,str
]]]) – The attributes to use in the transformation, orNone
if all the attributes should be used.name (
str
) – Name or common prefix of the attribute(s) to be created.
- Returns
The original dataset with the prediction column(s) appended.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
predict_proba
(model, attrs=None, name='proba')¶ Return a DataSet with prediction probability columns for the model.
- Parameters
model (
Union
[GLM
,RandomForestClassifier
,GradientBoostedTreeClassifier
]) – The model to predict outcomes with.attrs (
Optional
[List
[Union
[Attribute
,str
]]]) – The attributes to use in the transformation, orNone
if all the attributes should be used.name (
str
) – Name or common prefix of the attribute(s) to be created.
- Returns
The original dataset with the prediction probability column(s) appended.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
property
Attribute class¶
-
class
leapyear.dataset.
Attribute
(expr, relation, *, ordering=OrderSpec(osDirection=SortDirection.Asc, osNullOrdering=None), name=None)¶ An attribute of a Dataset.
This exists for transformations to be performed on single attributes of the dataset. For example, the attribute height might be measured in meters however centimeters might be more appropriate, so an intermediate Attribute can be extracted from the DataSet and manipulated.
All attribute manipulations are lazy; they are not evaluated until they are needed to perform an analysis on the LeapYear server.
-
property
type
¶ Get the data type of this attribute.
- Return type
AttributeType
-
property
expression
¶ Get the Attribute’s expression.
- Return type
Expression
-
property
ordering
¶ Get the ordering of this attribute.
- Return type
OrderSpec
-
alias
(name)¶ Associate an alias with an attribute.
- Parameters
name (
str
) – The name to associate with an attribute.- Return type
Attribute
-
is_not
()¶ Return the boolean inverse of the attribute.
- Return type
Attribute
-
sign
()¶ Return the sign of each element (1, 0 or -1).
- Return type
Attribute
-
floor
()¶ Apply the floor transformation to the attribute.
Each attribute value is transformed to the largest integer less than or equal to the attribute value.
- Return type
Attribute
-
ceil
()¶ Apply the ceiling transformation to the attribute.
Each attribute value is transformed to the smallest integer greater than or equal to the attribute value.
- Return type
Attribute
-
exp
()¶ Apply exponent transformation to an attribute.
- Return type
Attribute
-
expm1
()¶ Apply exponent transformation to an attribute and subtract one.
- Return type
Attribute
-
sqrt
()¶ Apply square root transformation to an attribute.
- Return type
Attribute
-
log
()¶ Apply natural logarithm transformation to an attribute.
- Return type
Attribute
-
log1p
()¶ Apply natural logarithm transformation to an attribute and add 1.
- Return type
Attribute
-
sigmoid
()¶ Apply sigmoid transformation to an attribute.
- Return type
Attribute
-
replace
(mapping)¶ Replace specified values with the new values or attribute expressions.
Values matching the keys of the mapping and replaced by the associated values. Values that do not match any of the keys are kept.
Keys and values may be python literals supported by the LeapYear client, or other Attributes.
- Parameters
mapping (
Mapping
[Union
[float
,bool
,int
,str
,datetime
,date
],Union
[Attribute
,Expression
,float
,bool
,int
,str
,datetime
,date
]]) – A mapping from values of this attribute’s type (T) to another set of values and a different type (U). U may be a python literal type (int, bool, datetime, …) or another Attribute.- Returns
The converted attribute.
- Return type
-
property
microsecond
¶ Return the microseconds part of a temporal type.
- Return type
Attribute
-
property
second
¶ Return the seconds part of a temporal type.
- Return type
Attribute
-
property
minute
¶ Return the minutes part of a temporal type.
- Return type
Attribute
-
property
hour
¶ Return the hours part of a temporal type.
- Return type
Attribute
-
property
day
¶ Return the days part of a temporal type.
- Return type
Attribute
-
property
month
¶ Return the months part of a temporal type.
- Return type
Attribute
-
property
year
¶ Return the years part of a temporal type.
- Return type
Attribute
-
greatest
(attrs)¶ Take the elementwise maximum.
Return
NULL
if and only if all attribute values areNULL
.- Return type
Attribute
-
least
(attrs)¶ Take the elementwise minimum.
Return
NULL
if and only if all entries areNULL
.- Return type
Attribute
-
coalesce
(fallthrough, attrs=None)¶ Convert all
NULL
values of an attribute to a value.This function will extend the type of the attribute if necessary and drop the nullable tag from the attribute data type.
- Parameters
- Returns
The non-nullable attribute with extended type range (if necessary).
- Return type
Examples
1. Use coalesce to replace missing values with ‘1’ and create a new attribute ‘col1_trn’:
>>> ds2 = ds1.with_attributes({'col1_trn':ds1['col1'].coalesce('1')})
-
isnull
()¶ Boolean check whether an attribute value is
NULL
.Return a boolean attribute which resolves to
True
whenever the underlying attribute value isNULL
.- Return type
Attribute
-
decode
(mapping)¶ Map specified values to the new values or attribute expressions.
Values matching the keys of the mapping and replaced by the associated values.
Values that do not match any of the keys are replaced by
NULL
.Keys and values may be python literals supported by the LeapYear client, or other Attributes.
If the combination of keys assure there should be no
NULL
values, the client will not automatically convert the result of decode to a non-nullable type. The user must use coalesce to removeNULL
from the domain of possible values.- Parameters
mapping (
Mapping
[Any
,Any
]) – A mapping from values of this attribute’s type (T) to another set of values and a different type (U). U may be a python literal type (int, bool, datetime, …) or another Attribute.- Returns
The converted attribute.
- Return type
Examples
1. Create a new column ‘col1_trn’ that is based on values in ‘col1’. If the value matches ‘pattern’ we assign the same string otherwise we assign ‘Other’ using the decode function:
>>> import leapyear.functions as f >>> some_func = lambda x: (x != 'pattern').decode({True:'Other', False:'pattern'}) >>> ds2 = ds1.with_attributes({'col1_trn': some_func(f.col('col1'))})
2. Create a new column ‘col1_trn’ that is based on values in ‘col1’. If the value == 0 then ‘col1_trn’ takes the value 0. Otherwise, it takes the value of ‘col2’:
>>> import leapyear.functions as f >>> some_func = lambda x: (x==0).decode({True:0,False:f.col('col2')}) >>> ds2 = ds1.with_attributes({'col1_trn': some_func(f.col('col1'))})
3. Create a new column ‘col1_trn’ that is based on values in ‘col1’ and ‘col2. Based on an expression involving ‘col1’ and ‘col2’, ‘col1_trn’ takes the value ‘col1 or ‘col2’:
>>> import leapyear.functions as f >>> def some_func(x,y): return ((x==0)&(y!=0)).decode({True:f.col('col1'),False:f.col('col2')}) >>> ds2 = ds1.with_attributes({'col1_trn': some_func(f.col('col1'),f.col('col2'))})
-
is_in
(options)¶ Boolean check whether an attribute value is in a list.
Return a boolean attribute which resolves to
True
whenever the underlying attribute matches one of the list entries.- Return type
Attribute
-
text_to_bool
()¶ Convert the attribute to a boolean.
Strings that match (case insensitively) “1”, “y”, “yes”, “t”, or “true” are converted to
True
;Strings that match (case insensitively) “0”, “n”, “no”, “f”, or “false” are converted to
False
. Other strings are treated asNULL
.- Return type
Attribute
-
text_to_real
(lb, ub)¶ Convert a text attribute to a real-valued attribute.
- Parameters
- Return type
Attribute
-
text_to_factor
(distinct_vals_list)¶ Convert a text attribute to a factor attribute.
-
text_to_int
(lb, ub)¶ Convert a text attribute to an integer-valued attribute.
NOTE: integers close to
MAX_INT64
orMIN_INT64
are not represented precisely.
-
as_real
()¶ Convert the attribute to a real.
- Return type
Attribute
-
as_factor
()¶ Represent this attribute as a factor attribute.
Converts the attribute of type INT or BOOL to FACTOR.
Note: For more complex conversions, consider
decode()
.Examples
Convert an attribute ‘col1’ in ds1 to factor:
>>> ds2 = ds1.with_attributes({'col1_fac':ds1['col1'].as_factor()})
- Return type
Attribute
-
asc
()¶ Sort ascending.
Causes the attribute to be sorted in ascending order when the sorting order is applied to the dataset.
Examples
Order a dataset by multiple cols and drop duplicates:
>>> ds2 = ds1.order_by(ds1['col1'].asc(),ds1['col2'].asc(),ds1['col3'].asc()) >>> ds2 = ds2.drop_duplicates(['col2'])
- Return type
Attribute
-
asc_nulls_first
()¶ Sort ascending, missing values first.
Causes the attribute to be sorted in ascending order when the sorting order is applied to the dataset, with
NULL
values first.- Return type
Attribute
-
asc_nulls_last
()¶ Sort ascending, missing values last.
Causes the attribute to be sorted in ascending order when the sorting order is applied to the dataset, with
NULL
values last.- Return type
Attribute
-
desc
()¶ Sort descending.
Causes the attribute to be sorted in descending order when the sorting order is applied to the dataset.
- Return type
Attribute
-
desc_nulls_first
()¶ Sort descending, missing values first.
Causes the attribute to be sorted in descending order when the sorting order is applied to the dataset, with
NULL
values first.- Return type
Attribute
-
desc_nulls_last
()¶ Sort descending, missing values last.
Causes the attribute to be sorted in descending order when the sorting order is applied to the dataset, with
NULL
values last.- Return type
Attribute
-
property
-
class
leapyear.dataset.
AttributeType
(*args, **kwargs)¶ The type of an Attribute.
An AttributeType is a read-only object returned by the server, and should never be constructed directly.
Aliases¶
-
leapyear.dataset.attribute.
AttributeLike
(*args, **kwargs)¶
Attribute-like objects are those that can be readily converted to an attribute.
These include existing attributes, dates, strings, integers, floats and expressions based on these objects.
Grouping and Windowing classes¶
-
class
leapyear.dataset.
GroupedData
(grouping, rel)¶ The result of a
DataSet.group_by()
.Run
agg()
to get a DataSet.-
agg
(*attr_aggs, max_count=None, **kwargs)¶ Specify the aggregations to perform on the GroupedData.
- Parameters
attr_aggs (
Tuple
[List
[Union
[Attribute
,str
]],Union
[str
,Aggregation
]]) –A list of pairs. The second element of the pair is the aggregation to perform; the first is a list of columns on which to perform it. This is a list and not just a single attribute to support nullary and binary aggregations.
Available aggregations:
min
max
and
or
count_distinct
approx_count_distinct
count
mean
stddev
stddev_samp
stddev_pop
variance
var_samp
var_pop
skewness
kurtosis
sum
sum_distinct
covar_samp
covar_pop
corr
If count aggregate is requested, this parameter will be used as an upper bound in the schema of the derived aggregate count attribute(s) of the grouped
DataSet
. If not supplied, the upper bound would be inferred from the data before returning the resultingDataSet
object.Note
When this parameter is set too high, differentially private computations on the derived aggregate attribute would include higher randomization effect. When it is set too low, all counts higher than
max_count
will be replaced bymax_count
.kwargs (
Any
) – All keyword arguments are passed to the functionrun
when the parametermax_count
has to be inferred from the data. This includesmax_timeout_sec
, which defaults to 300 seconds.
- Returns
A
DataSet
object, containing aggregation results.- Return type
Example
To compute correlation of height and weight as well as the count, run:
>>> gd.agg((['height','weight'], 'corr'), ([], 'count'))
-
-
class
leapyear.dataset.
Window
¶ Utility functions for defining WindowSpec.
Examples
>>> # ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW >>> window = Window.order_by("date").rows_between( >>> Window.unbounded_preceding, Window.current_row, >>> )
>>> # PARTITION BY country ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING >>> window = Window.order_by("date").partition_by("country").rows_between(-3, 3)
-
classmethod
partition_by
(*cols)¶ Create a WindowSpec with the partitioning defined.
-
classmethod
order_by
(*cols)¶ Create a WindowSpec with the ordering defined.
-
classmethod
rows_between
(start, end)¶ Create a WindowSpec with the frame boundaries defined.
Create a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive). Both start and end are relative positions from the current row. For example, “0” means “current row”, while “-1” means the row before the current row, and “5” means the fifth row after the current row. We recommend users use Window.unboundedPreceding, Window.unboundedFollowing, and Window.currentRow to specify special boundary values, rather than using integral values directly.
Note: windows of 1000 rows or more are not currently supported for expressions other than lead and lag.
- Parameters
start (
int
) – boundary start, inclusive. The frame is unbounded if this isWindow.unboundedPreceding
, or any value less than or equal to -9223372036854775808.end (
int
) – boundary end, inclusive. The frame is unbounded if this isWindow.unboundedFollowing
, or any value greater than or equal to 9223372036854775807.
- Return type
WindowSpec
-
classmethod
Module leapyear.functions¶
Functions for Attributes.
Datetime functions¶
Time functions for Attributes.
-
leapyear.functions.time.
add_months
(start_date, num_months)¶ Return the date that is num_months after start_date.
NOTE: This function can take a datetime as input, but produces a date output regardless.
Examples
>>> str(dt) '2004-02-29 23:59:59' >>> add_months(dt, 1) '2004-03-29' >>> add_months(dt, 13) '2005-03-01'
- Return type
Attribute
-
leapyear.functions.time.
date_add
(start, days)¶ Return the date that is days days after start.
NOTE: This function can take a datetime as input, but produces a date output regardless.
Examples
>>> str(dt) '2004-02-28 23:59:59' >>> date_add(dt, 1) '2004-02-29' >>> date_add(dt, 2) '2004-03-01'
- Return type
Attribute
-
leapyear.functions.time.
date_sub
(start, days)¶ Return the date that is days days before start.
NOTE: This function can take a datetime as input, but produces a date output regardless.
Examples
>>> str(dt) '2004-03-01 23:59:59' >>> date_sub(dt, 1) '2004-02-29' >>> date_sub(dt, 2) '2004-02-29'
- Return type
Attribute
-
leapyear.functions.time.
datediff
(end, start)¶ Return the number of days from start to end.
Examples
1. Create date diff column ‘date_sub’ that is a difference between datetime column ‘col1’ and datetime column ‘col2’ in ds1:
>>> import leapyear.functions as f >>> ds2 = ds1.with_attributes({'date_sub':f.time.datediff(f.col('col1'),f.col('col2'))})
- Return type
Attribute
-
leapyear.functions.time.
dayofmonth
(e)¶ Extract the day of the month as an integer from a given date/datetime attribute.
- Return type
Attribute
-
leapyear.functions.time.
dayofyear
(e)¶ Extract the day of the year as an integer from a given date/datetime attribute.
- Return type
Attribute
-
leapyear.functions.time.
hour
(e)¶ Extract the hours as an integer from a given date/datetime attribute.
- Return type
Attribute
-
leapyear.functions.time.
last_day
(e)¶ Return the last day of the month which the given date belongs to.
NOTE: This function can take a datetime as input, but produces a date output regardless.
Examples
>>> str(dt) '2004-02-01 23:59:59' >>> last_day(dt) '2004-02-29'
- Return type
Attribute
-
leapyear.functions.time.
minute
(e)¶ Extract the minutes as an integer from a given date/datetime attribute.
- Return type
Attribute
-
leapyear.functions.time.
month
(e)¶ Extract the month as an integer from a given date/datetime attribute.
- Return type
Attribute
-
leapyear.functions.time.
months_between
(date1, date2)¶ Return integer number of months between dates date1 and date2.
This is based on both the day and the month, not the number of days between the dates. Note in the last line of the examples that there is 1 month between
dateC
anddateB
even though there are only 29 days between them. The result is negative ifdate1
is >=1 month beforedate2
. The number of months is always rounded down to the nearest integer.Examples
>>> str(dateA) '2004-01-01' >>> str(dateB) '2004-02-01' >>> str(dateC) '2004-03-02' >>> months_between(dateA, dateB) -1 >>> months_between(dateB, dateA) 1 >>> months_between(dateC, dateB) 1
- Return type
Attribute
-
leapyear.functions.time.
next_day
(date, day_of_week)¶ Return the next date that falls on the specified day of the week.
NOTE: This function can take a datetime as input, but produces a date output regardless.
Examples
>>> str(dt) '2004-02-23 23:59:59' >>> next_day(dt, "Sunday") '2004-02-29'
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.time.
quarter
(e)¶ Extract the quarter as an integer from a given date/datetime attribute.
- Return type
Attribute
-
leapyear.functions.time.
second
(e)¶ Extract the seconds as an integer from a given date/datetime attribute.
- Return type
Attribute
-
leapyear.functions.time.
to_date
(e)¶ Convert the column into Date type.
Examples
Create truncated column ‘date_trunc’ from a datetime column ‘col1’ in ds1:
>>> import leapyear.functions as f >>> ds2 = ds1.with_attributes({'date_trunc':f.time.to_date(f.col('col1'))})
- Return type
Attribute
-
leapyear.functions.time.
to_datetime
(ts)¶ Convert the column into Datetime type.
- Return type
Attribute
-
leapyear.functions.time.
trunc
(date, fmt)¶ Return date truncated to the unit specified by the format.
fmt can be one of: “year”, “month”, “day”
NOTE: This function can take a datetime as input, but produces a date output regardless.
Examples
>>> str(dt) '2004-02-29 23:59:59' >>> trunc(dt, "month") '2004-02-01'
- Return type
Attribute
-
leapyear.functions.time.
weekofyear
(e)¶ Extract the week number as an integer from a given date/datetime attribute.
Week numbers range from 1 to (up to) 53, and new weeks start on Monday. Dates at the begining / end of a year may be considered to be part of a week from the previous / next year. Full documentation of week numbering can be found here.
- Return type
Attribute
-
leapyear.functions.time.
year
(e)¶ Extract the year as an integer from a given date/datetime attribute.
- Return type
Attribute
-
leapyear.functions.time.
yearofweek
(e)¶ Extract the year based on the ISO week numbering where a week associated the previous year may spill over into the next calendar year.
Full documentation of week numbering can be found here.
- Return type
Attribute
-
leapyear.functions.time.
year_with_week
(e)¶ Extract the year and the ISO week from a Date column and returns a Factor which combines the two.
For example, given a row containing ‘dt(2021,1,1)’ this function returns ‘2020-53’.
Full documentation with week numbering can be found here.
- Return type
Attribute
-
leapyear.functions.time.
dayofweek
(e)¶ Extract the day of the week number as an integer from a given date/datetime attribute, where Monday = 1, …, Sunday = 7.
- Return type
Attribute
-
leapyear.functions.time.
parse_clamped_time
(e, bounds, *, fmt=None)¶ Parse a Text column and return a DateTime column using the given format clamped to bounds.
Format specification and default format depends on the back end:
- Spark
default format:
yyyy-MM-dd HH:mm:ss
(Java reference)- Snowflake
default format:
yyyy-MM-DD HH:MI:SS
(Snowflake reference)
Examples
>>> import datetime.datetime as dt >>> parse_clamped_time(f.col("date"), bounds=(dt(2000, 1, 1), dt(2020, 12, 31)), fmt="yyyMMdd HHmmss")
- Return type
Attribute
Math functions¶
Math functions for Attributes.
-
leapyear.functions.math.
acos
(e)¶ Compute the cosine inverse of the given value.
The returned angle is in the range 0 through \(\pi\).
- Return type
Attribute
-
leapyear.functions.math.
asin
(e)¶ Compute the sine inverse of the given value.
The returned angle is in the range -pi/2 through pi/2.
- Return type
Attribute
-
leapyear.functions.math.
atan
(e)¶ Compute the tangent inverse of the given value.
- Return type
Attribute
-
leapyear.functions.math.
cbrt
(e)¶ Compute the cube-root of the given value.
- Return type
Attribute
-
leapyear.functions.math.
ceil
(e)¶ Compute the ceiling of the given value.
- Return type
Attribute
-
leapyear.functions.math.
cos
(e)¶ Compute the cosine of the given value.
- Return type
Attribute
-
leapyear.functions.math.
cosh
(e)¶ Compute the hyperbolic cosine of the given value.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.math.
degrees
(e)¶ Convert an angle measured in radians to an equivalent angle measured in degrees.
- Return type
Attribute
-
leapyear.functions.math.
exp
(e)¶ Compute the exponential of the given value.
- Return type
Attribute
-
leapyear.functions.math.
expm1
(e)¶ Compute the exponential of the given value minus one.
- Return type
Attribute
-
leapyear.functions.math.
floor
(e)¶ Compute the floor of the given value.
- Return type
Attribute
-
leapyear.functions.math.
hypot
(x, y)¶ Compute sqrt(x^2 + y^2) without intermediate overflow or underflow.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.math.
log
(e)¶ Compute the natural logarithm of the given value.
- Return type
Attribute
-
leapyear.functions.math.
log10
(e)¶ Compute the logarithm of the given value in base 10.
- Return type
Attribute
-
leapyear.functions.math.
log1p
(e)¶ Compute the natural logarithm of the given value plus one.
- Return type
Attribute
-
leapyear.functions.math.
log2
(e)¶ Compute the logarithm of the given column in base 2.
- Return type
Attribute
-
leapyear.functions.math.
sigmoid
(e)¶ Compute the sigmoid of the given value.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.math.
pow
(base, exp)¶ Return the value of the first argument raised to the power of the second argument.
- Return type
Attribute
-
leapyear.functions.math.
radians
(e)¶ Convert an angle measured in degrees to an equivalent angle measured in radians.
- Return type
Attribute
-
leapyear.functions.math.
round
(e, scale=0)¶ Round the value of e to scale decimal places.
Examples
1. Create a new attribute ‘col1_trn’ which is derived by rounding ‘col1’ to 2 digits:
>>> import leapyear.functions as f >>> ds2 = ds1.with_attributes({'col1_trn': f.math.round(f.col('col1'),2)})
- Return type
Attribute
-
leapyear.functions.math.
signum
(e)¶ Compute the signum of the given value.
- Return type
Attribute
-
leapyear.functions.math.
sin
(e)¶ Compute the sine of the given value.
- Return type
Attribute
-
leapyear.functions.math.
sinh
(e)¶ Compute the hyperbolic sine of the given value.
- Return type
Attribute
-
leapyear.functions.math.
sqrt
(e)¶ Compute the square root of the specified float value.
- Return type
Attribute
-
leapyear.functions.math.
tan
(e)¶ Compute the tangent of the given value.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.math.
tanh
(e)¶ Compute the hyperbolic tangent of the given value.
- Return type
Attribute
-
leapyear.functions.math.
erf
(e)¶ Compute the error function on the given value.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.math.
erfc
(e)¶ Compute the error function on the given value.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.math.
inverf
(e)¶ Compute the error function on the given value.
- Return type
Attribute
-
leapyear.functions.math.
inverfc
(e)¶ Compute the error function on the given value.
- Return type
Attribute
Non-aggregate functions¶
Non-aggregate functions for Attributes.
-
leapyear.functions.non_aggregate.
all
(*exprs)¶ Return True if all columns are True.
- Return type
Attribute
-
leapyear.functions.non_aggregate.
attr_in
(input_attr_name, in_list)¶ Expression for filtering rows based on attribute values that are IN a list of values.
- Parameters
input_attr_name (String) – Name of Attribute based on which to filter
in_list (Iterable) – List values for filtering
- Returns
Boolean Attribute
- Return type
Examples
Filtering based on an attribute taking a value from a list of values:
>>> import leapyear.functions as f >>> ds2 = ds.where(f.attr_in('col1',in_list=[val1,val2,val3]))
-
leapyear.functions.non_aggregate.
attr_not_in
(input_attr_name, not_in_list)¶ Expression for filtering rows based on attribute values that are IN a list of values.
- Parameters
- Returns
Boolean Attribute
- Return type
Examples
Filtering based on an attribute NOT taking a value from a list of values:
>>> import leapyear.functions as f >>> ds2 = ds.where(f.attr_not_in('col1',not_in_list=[val1,val2,val3]))
-
leapyear.functions.non_aggregate.
any
(*exprs)¶ Return True if any column is True.
- Return type
Attribute
-
leapyear.functions.non_aggregate.
abs
(e)¶ Compute the absolute value.
- Return type
Attribute
-
leapyear.functions.non_aggregate.
col
(col_name)¶ Return an Attribute based on the given name.
- Return type
Attribute
-
leapyear.functions.non_aggregate.
column
(col_name)¶ Return an Attribute based on the given name.
- Return type
Attribute
-
leapyear.functions.non_aggregate.
greatest
(*exprs)¶ Return the greatest value of the list of column names, skipping null values.
- Return type
Attribute
-
leapyear.functions.non_aggregate.
isnull
(e)¶ Return true iff the column is null.
- Return type
Attribute
-
leapyear.functions.non_aggregate.
least
(*exprs)¶ Return the greatest value of the list of column names, skipping null values.
- Return type
Attribute
-
leapyear.functions.non_aggregate.
lit
(literal)¶ Create an Attribute of literal value.
-
leapyear.functions.non_aggregate.
negate
(e)¶ Unary minus.
- Return type
Attribute
-
leapyear.functions.non_aggregate.
not_
(e)¶ Inversion of boolean expression.
- Return type
Attribute
-
leapyear.functions.non_aggregate.
when
(condition, value)¶ Evaluate a list of conditions and returns one of multiple possible result expressions.
If otherwise is not defined at the end, null is returned for unmatched conditions.
Example
Encoding gender string column into an integer.
>>> df.select(when(col("gender") == "male", 0) ... .when(col("gender") == "female", 1) ... .otherwise(2))
- Return type
Attribute
-
leapyear.functions.non_aggregate.
to_text
(e)¶ Convert an attribute to a string representation.
- Return type
Attribute
String functions¶
Math functions for Attributes.
-
leapyear.functions.string.
ascii
(e)¶ Convert the first character of the string to its ASCII value.
- Return type
Attribute
-
leapyear.functions.string.
concat
(*exprs, sep=None)¶ Concatenation of strings with optional separator.
Concatenating factors {‘a’, ‘b’} and {‘c’, ‘d’} gives {‘ac’, ‘ad’, ‘bc’, ‘cd’}. Including the separator ” ” gives {‘a c’, ‘a d’, ‘b c’, ‘c d’}.
If any of the expressions are type TEXT, then the result will be TEXT.
Examples
1. Create new column ‘col1_trn’ which concatenation of ‘col1’ and ‘col2’ with separator ‘-‘. It can be used to contruct a date column from day and month columns:
>>> import leapyear.functions as f >>> ds2 = ds1.with_attributes({'col1_trn':f.concat(ds1['col1'],ds1['col2'],sep='-')})
- Return type
Attribute
-
leapyear.functions.string.
instr
(attr, substr)¶ Give the position of substring in the Attribute, otherwise 0.
- Return type
Attribute
-
leapyear.functions.string.
length
(attr)¶ Return the length of the string.
- Return type
Attribute
-
leapyear.functions.string.
levenshtein
(attr1, attr2)¶ Compute the Levenshtein distance between strings.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.string.
locate
(attr, substr, pos=None)¶ Give the position of substr in the Attribute, optionally after pos.
Position is returned as a positive integer starting at 1 if the substring is found, and 0 if the substring is not found.
-
leapyear.functions.string.
lpad
(attr, len_, pad)¶ Pad the string on the left with pad to make the total length len_.
When pad is an empty string, the string is returned without modification, or truncated to len_.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.string.
ltrim
(attr)¶ Remove whitespace characters on the beginning of the string.
- Return type
Attribute
-
leapyear.functions.string.
reverse
(attr)¶ Reverse the string.
- Return type
Attribute
-
leapyear.functions.string.
repeat
(attr, n)¶ Repeat the string n times.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.string.
rpad
(attr, len_, pad)¶ Pad the string on the right with pad to make the total length len_.
When pad is an empty string, the string is returned without modification, or truncated to len_.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.string.
rtrim
(attr)¶ Remove whitespace characters at the end of the string.
- Return type
Attribute
-
leapyear.functions.string.
soundex
(attr)¶ Return the soundex code for the specified expression.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.string.
substring
(attr, start, len_)¶ Return the substring of length len_ starting at start.
- Return type
Attribute
-
leapyear.functions.string.
substring_index
(attr, delim, len_)¶ Return the substring from string str before count occurrences of the delimiter delim.
If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a case-sensitive match when searching for delim.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.string.
translate
(attr, match, replace)¶ Translate any character in the src by a character in replace.
The characters in replace correspond to the characters in matching. The translate will happen when any character in the string matches the character in the match.
- Return type
Attribute
-
leapyear.functions.string.
trim
(attr)¶ Remove the whitespace from the beginning and end of the string.
- Return type
Attribute
-
leapyear.functions.string.
lex_lt
(attr1, attr2)¶ Lexicographical less-than operation.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.string.
lex_lte
(attr1, attr2)¶ Lexicographical less-than-or-equal operation.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.string.
lex_gt
(attr1, attr2)¶ Lexicographical greater-than operation.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.string.
lex_gte
(attr1, attr2)¶ Lexicographical greater-than-or-equal operation.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.string.
remove_accents
(attr)¶ Remove all accents from the string.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Attribute
-
leapyear.functions.string.
lower
(attr)¶ Convert strings to lowercase.
- Return type
Attribute
-
leapyear.functions.string.
upper
(attr)¶ Convert strings to uppercase.
- Return type
Attribute
-
leapyear.functions.string.
regex_extract
(attr, pattern, group_idx)¶ Use a regular expression pattern to extract a part of the string.
This function always results in a nullable Text type.
The regex syntax depends on the backend:
- Spark
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
- Snowflake
https://docs.snowflake.com/en/sql-reference/functions-regexp.html#general-usage-notes
- Return type
Attribute
-
leapyear.functions.string.
regex_replace
(attr, pattern, replace)¶ Use a regular expression pattern to replace a part of the string.
This function always results in a nullable Text type.
The regex syntax depends on the backend:
- Spark
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
- Snowflake
https://docs.snowflake.com/en/sql-reference/functions-regexp.html#general-usage-notes
- Return type
Attribute
Windowing functions¶
Window functions for Attributes.
-
leapyear.functions.window.
lead
(e, i)¶ Compute the lead value.
- Return type
WindowAttribute
-
leapyear.functions.window.
lag
(e, i)¶ Compute the lag value.
- Return type
WindowAttribute
-
leapyear.functions.window.
first
(e)¶ Compute the first non-null value.
- Return type
WindowAttribute
-
leapyear.functions.window.
last
(e)¶ Compute the last non-null value.
- Return type
WindowAttribute
-
leapyear.functions.window.
count
(e=None)¶ Compute the number of non-null entries.
- Return type
WindowAttribute
-
leapyear.functions.window.
approx_count_distinct
(attrs)¶ Compute the number of distinct entries (using an approximate streaming algorithm).
- Return type
WindowAttribute
-
leapyear.functions.window.
mean
(e)¶ Compute the mean value.
Examples
1. Create a window specification that partitions based on ‘col1’ and orders by a ‘date_col’ and picks past 14 rows from current row 0. Then, we can compute mean of ‘col2’ over this window:
>>> import leapyear.functions as f >>> from leapyear.dataset import Window >>> ws1 = Window.partition_by(f.col('col1')) .order_by(f.col('date_col').asc()) .rows_between(start=-14,end=0) >>> ds2 = ds1.project(['date_col','col1','col2']) .with_attribute('mean_col2', f.window.mean(f.col('col2')).over(ws1))
- Return type
WindowAttribute
-
leapyear.functions.window.
avg
(e)¶ Compute the mean value.
Examples
1. Create a window specification that partitions based on ‘col1’ and orders by a ‘date_col’ and picks past 14 rows from current row 0. Then, we can compute mean of ‘col2’ over this window:
>>> import leapyear.functions as f >>> from leapyear.dataset import Window >>> ws1 = Window.partition_by(f.col('col1')) .order_by(f.col('date_col').asc()) .rows_between(start=-14,end=0) >>> ds2 = ds1.project(['date_col','col1','col2']) .with_attribute('mean_col2', f.window.mean(f.col('col2')).over(ws1))
- Return type
WindowAttribute
-
leapyear.functions.window.
sum
(e)¶ Compute the sum.
- Return type
WindowAttribute
-
leapyear.functions.window.
or_
(e)¶ Compute the or of boolean values.
- Return type
WindowAttribute
-
leapyear.functions.window.
and_
(e)¶ Compute the and of boolean values.
- Return type
WindowAttribute
-
leapyear.functions.window.
min
(e)¶ Compute the min value.
- Return type
WindowAttribute
-
leapyear.functions.window.
max
(e)¶ Compute the max value.
- Return type
WindowAttribute
-
leapyear.functions.window.
stddev
(e)¶ Compute the sample standard deviation.
- Return type
WindowAttribute
-
leapyear.functions.window.
stddev_samp
(e)¶ Compute the sample standard deviation.
- Return type
WindowAttribute
-
leapyear.functions.window.
stddev_pop
(e)¶ Compute the population standard deviation.
- Return type
WindowAttribute
-
leapyear.functions.window.
variance
(e)¶ Compute the sample variance.
- Return type
WindowAttribute
-
leapyear.functions.window.
variance_samp
(e)¶ Compute the sample variance.
- Return type
WindowAttribute
-
leapyear.functions.window.
variance_pop
(e)¶ Compute the population variance.
- Return type
WindowAttribute
-
leapyear.functions.window.
skewness
(e)¶ Compute the skewness.
- Return type
WindowAttribute
-
leapyear.functions.window.
kurtosis
(e)¶ Compute the kurtosis.
- Return type
WindowAttribute
-
leapyear.functions.window.
covar_samp
(e1, e2)¶ Compute the sample covariance.
- Return type
WindowAttribute
-
leapyear.functions.window.
covar_pop
(e1, e2)¶ Compute the population covariance.
- Return type
WindowAttribute
-
leapyear.functions.window.
corr
(e1, e2)¶ Compute the correlation.
- Return type
WindowAttribute
Module leapyear.feature¶
Feature engineering classes.
OneHotEncoder class¶
-
class
leapyear.feature.
OneHotEncoder
(input_cols, max_size=32, drop_originals=True, drop_last=True)¶ One-hot encode attributes.
FACTOR and INT columns that can have less than max_size values are converted to BOOL columns indicating the presence of the value. Will only work with non-nullable columns. Nullable columns can be converted to non-nullable with
leapyear.dataset.Attribute.coalesce()
.The last category is not included by default, where the categories are sorted lexicographically based on their characters’ ASCII values.
Examples
Using OneHotEncoder on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> ohe = OneHotEncoder(['col1', 'col2'], drop_originals=True, max_size=64) >>> ds2 = ohe.transform(ds1)
- Parameters
input_cols (Sequence[str]) – the names of the input columns.
max_size (int) – maximum number of values to one hot encode per column. (default: 32)
drop_originals (bool) – drop columns not derived from the input columns. (default: True)
drop_last (bool) – drop the last column containing redundant information. (default: True)
BoundsScaler class¶
-
class
leapyear.feature.
BoundsScaler
(input_cols, lower=0.0, upper=1.0)¶ Scale the attributes by the bounds of the type.
BOOL, INT and REAL columns are scaled so all values fall between min and max (inclusive). In contrast to MinMaxScaler and StandardScaler, there is no privacy leakage using this class.
Examples
Using BoundsScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> bs = BoundsScaler(['col1','col2']) >>> ds2 = bs.fit_transform(ds1)
BoundsAbsScaler class¶
-
class
leapyear.feature.
BoundsAbsScaler
(input_cols)¶ Scale the attributes by the max absolute value of the type bounds.
INT and REAL columns are scaled so all values fall between -1 and 1, with no shifting of the data.
Examples
Using BoundsAbsScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> bs = BoundsAbsScaler(['col1','col2']) >>> ds2 = bs.fit_transform(ds1)
- Parameters
input_cols (Sequence[str]) – the names of the input columns.
MinMaxScaler class¶
-
class
leapyear.feature.
MinMaxScaler
(input_cols, min_=0.0, max_=1.0)¶ Scale the attributes by the min and max of the attribute.
BOOL, INT and REAL columns are scaled so all values fall between min and max (inclusive).
Examples
Using MinMaxScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> ms = MinMaxScaler(['col1','col2'], min_=0.0, max_=1.0) >>> ds2 = ms.fit_transform(ds1)
MaxAbsScaler class¶
-
class
leapyear.feature.
MaxAbsScaler
(input_cols)¶ Scale the attributes by the max absolute value of the min and the max.
INT and REAL columns are scaled so that values smaller than the absolute value of the min or max fall between -1 and 1, with no shifting of the data.
Examples
Using BoundsAbsScaler on two columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> bs = BoundsAbsScaler(['col1','col2']) >>> ds2 = bs.fit_transform(ds1)
- Parameters
input_cols (Sequence[str]) – the names of the input columns.
StandardScaler class¶
-
class
leapyear.feature.
StandardScaler
(input_cols, with_mean=True, with_stdev=True)¶ Scale the attributes to be centered at zero with unit variance.
INT and REAL columns are scaled with removed mean and scaled to unit variance.
Examples
Using StandardScaler on columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> ss = StandardScaler(['col1','col2'], with_mean=True, with_stdev=False ) >>> ds2 = ss.fit_transform(ds1)
ScaleTransformModel class¶
-
class
leapyear.feature.
ScaleTransformModel
(attr_lower_upper, lower, upper, scale_bool)¶ Scale attributes to new values.
Shift and scale each attribute so the 2 values associated with each attribute are mapped to the 2 values in new_values.
Normalizer class¶
-
class
leapyear.feature.
Normalizer
(input_cols, p, suffix=<factory>)¶ Compute the p-norm of attributes and normalize by norm value.
Will only work on INT and REAL columns.
Examples
Using Normalizer on columns ‘col1’ and ‘col2’ in Dataset ‘ds1’:
>>> norm_trn = Normalizer(['col1','col2'], p = 2) >>> ds2 = norm_trn.fit_transform(ds1)
- Parameters
-
fit_transform
(dataset, **kwargs)¶ Normalize a set of attributes.
Computes p norm.
- Return type
DataSet
Winsorizer class¶
-
class
leapyear.feature.
Winsorizer
(input_col, lo_val, hi_val, suffix=<factory>)¶ Bound the non-null values of the attribute to be within lo_val and hi_val.
Will only work on non-nullable INT and REAL columns. Nullable columns can be converted to non-nullable with
leapyear.dataset.Attribute.coalesce()
.When transformed, returns the DataSet with an additional attribute that is winsorised between the given low and high value for the specified attribute.
Examples
Using Winsorizer on column ‘col1’ in Dataset ‘ds1’:
>>> wins = Winsorizer('col1', lo_val = 0, hi_val = 1) >>> ds2 = wins.fit_transform(ds1)
- Parameters
input_col (str) – The name of the input columns.
lo_val (float) – After transformation attribute will be greater than or equal to lo_val
hi_val (float) – After transformation attribute will be lesser than or equal to hi_val
suffix (str) – Suffix appended to the name of the transformed attribute. Defaults to ‘_WIN’ for Snowflake and ‘_win’ for Spark
-
fit_transform
(dataset, **kwargs)¶ Winsorize the specified attribute.
Called after providing lo and hi vals.
- Return type
DataSet
Bucketizer class¶
-
class
leapyear.feature.
Bucketizer
(input_col, split_vals)¶ Quantize the attribute according to thresholds specified in split_vals.
attr < split_vals[0] -> bin 0 attr >= split_vals[0] and attr < split_vals[1] -> bin 1 ... attr >= split_vals[-1] -> bin len(split_vals)
Works with INT and REAL columns.
Examples
Using Bucketizer on column ‘col1’ in Dataset ‘ds1’:
>>> split_vals = [0, 0.25, 0.75] >>> buck = Bucketizer('col1', split_vals = split_vals ) >>> ds2 = buck.fit_transform(ds1)
- Parameters
-
fit_transform
(dataset, **kwargs)¶ For testing use only.
- Return type
DataSet
Module leapyear.analytics¶
Statistics and machine learning algorithms.
LeapYear analyses are functions that are executed by the server to compute
statistics or to perform machine learning tasks on DataSets
. These
functions return an Analysis
type, which is executed on the server by calling
the run()
method.
For simple statistics, such as count()
or mean()
, the values can be extracted
using the following pattern:
>>> from leapyear import Client, DataSet
>>> from leapyear.analytics import count_rows, mean
>>> client = Client(url='http://ly-server:4401', username='admin', password='password')
>>> dataset = DataSet.from_table('db.table')
>>> dataset_rows_analysis = count_rows(dataset)
>>> n_rows = dataset_rows_analysis.run()
>>> print(n_rows)
10473
>>> dataset_mean_x_analysis = mean('x0', dataset)
>>> mean_x = dataset_mean_x_analysis.run()
>>> print(mean_x)
5.234212346345
The computation of all univariate statistics follows the pattern for mean()
. For
more complicated machine learning tasks, multiple columns must be specified, depending on the
task.
Unsupervised learning tasks (like clustering) will generally require the specification
of which features in the DataSet
to use. Supervised learning tasks (like regression)
will additionally require the specification of a target variable.
For example, we can train a linear regression model as follows:
>>> from leapyear.analytics import generalized_linreg
>>> regression = generalized_linreg(['x0', 'x1'], 'y', dataset, affine=True, l2reg=1.0)
>>> model = regression.run()
Helper routines are available for performing cross-validation
(see cross_val_score_linreg()
). Note that, unlike other analyses,
they are immediately executed (without calling run()
):
>>> from leapyear.analytics import cross_val_score_linreg
>>> cross_val_score = cross_val_score_linreg(
>>> ['x0', 'x1'], 'y', dataset, cv=3,
>>> affine=True, l1reg=0.1, l2reg=1.0, scorer='mse'
>>> )
Data Analysis¶
-
leapyear.analytics.
count
(attr, dataset=None, drop_nulls=False, target_relative_error=None, max_budget=None)¶ Analysis: Count the elements of an attribute.
This analysis can be executed using the
run
method to compute the approximate count of elements, includingNULL
values.The user can request additional information about the computation with
run(rich_result=True)
. In this case, an object ofRandomizationInterval
, will be generated likely including the precise value of the computation on the data sample.- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the count of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to ignoreNULL
values. Default:False
.target_relative_error (
Optional
[float
]) – A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.max_budget (
Optional
[float
]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.
- Returns
Analysis object that can be executed using the
run
method.- Return type
-
leapyear.analytics.
count_rows
(dataset, target_relative_error=None, max_budget=None)¶ Analysis: Count the number of rows in a dataset.
This analysis can be executed using the
run
method to compute the approximate number of rows in the dataset.The user can request additional information about the computation with
run(rich_result=True)
. In this case, an object ofRandomizationInterval
will be generated, likely including the precise value of the computation on the data sample.- Parameters
dataset (
DataSet
) – The input dataset.target_relative_error (
Optional
[float
]) – A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.max_budget (
Optional
[float
]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.
- Returns
Analysis object that can be executed using the
run
method.- Return type
-
leapyear.analytics.
count_distinct
(attr, dataset=None, drop_nulls=False, target_relative_error=None, max_budget=None)¶ Analysis: Count the unique elements of an attribute.
- Parameters
attr (
Union
[Attribute
,str
,Sequence
[Union
[Attribute
,str
]]]) – The attribute or attributes to compute the distinct count of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Remove any records with null. Unique values associated with records containing nulls are not included in the count.target_relative_error (
Optional
[float
]) – A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.max_budget (
Optional
[float
]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.
- Returns
Prepared analysis of the count.
- Return type
-
leapyear.analytics.
count_distinct_rows
(dataset, target_relative_error=None, max_budget=None)¶ Analysis: Count the number of distinct rows in a dataset.
- Parameters
dataset (
DataSet
) – The input dataset.target_relative_error (
Optional
[float
]) – A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.max_budget (
Optional
[float
]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.
- Returns
Analysis for counting the number of distinct rows.
- Return type
-
leapyear.analytics.
mean
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the mean of an attribute.
This analysis can be executed using the
run
method to compute the approximate mean of the attribute.The user can request additional information about the computation with
run(rich_result=True)
. In this case, theRandomizationInterval
object will be generated, likely including the precise value of the computation on the data sample.Note: If the attribute is nullable, setting
drop_nulls=True
is necessary for the computation to go through.- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the mean of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoringNULL
values.
- Returns
Analysis object that can be executed using the
run
method.- Return type
ScalarAnalysisWithCI
-
leapyear.analytics.
sum
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the sum of a numeric attribute.
This analysis can be executed using the
run
method to compute the approximate mean of the attribute.The user can request additional information about the computation with
run(rich_result=True)
. In this case, theRandomizationInterval
object will be generated, likely including the precise value of the computation on the data sample.Note: If the attribute is nullable, setting
drop_nulls=True
is necessary for the computation to go through.- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the sum of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoringNULL
values.
- Returns
Analysis object that can be executed using the
run
method.- Return type
-
leapyear.analytics.
variance
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the variance of an attribute.
This analysis can be executed using the
run
method to compute the approximate mean of the attribute.The user can request additional information about the computation with
run(rich_result=True)
. In this case, theRandomizationInterval
object will be generated, likely including the precise value of the computation on the data sample.Note: If the attribute is nullable, setting
drop_nulls=True
is necessary for the computation to go through.- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the variance of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoringNULL
values.
- Returns
Analysis object that can be executed using the
run
method.- Return type
ScalarAnalysisWithCI
-
leapyear.analytics.
min
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the minimum value of an attribute.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the min of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the min.
- Return type
Note
The minimum reported is the 1/1000 quantile of the attribute.
When the attribute being analyzed has a very narrow range of possible values, the minimum returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than
0.01
, the minimum returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by10/width
, compute the minimum, and rescale the returned value bywidth/10
.The result of this analysis may be very different from the true minimum of the data sample in the following two scenarios:
1. When the underlying attribute distribution has significant outliers (e.g. a very long tail) - this is because the minimum computed is the 1/1000 quantile of the attribute, and
2. When the public lower bound is very different from the true minimum of the data sample - this is because differential privacy is aiming to minimize the effect of individual records on the output.
-
leapyear.analytics.
max
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the maximum value of an attribute.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the max of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the max.
- Return type
Note
The maximum reported is the 999/1000 quantile of the attribute.
When the attribute being analyzed has a very narrow range of possible values, the maximum returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than
0.01
, the maximum returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by10/width
, compute the maximum, and rescale the returned value bywidth/10
.The result of this analysis may be very different from the true maximum of the data sample in the following two scenarios:
1. When the underlying attribute distribution has significant outliers (e.g. a very long tail) - this is because the maximum computed is the 999/1000 quantile of the attribute, and
2. When the public upper bound is very different from the true maximum of the data sample - this is because differential privacy is aiming to minimize the effect of individual records on the output.
-
leapyear.analytics.
median
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the median value of an attribute.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the median of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the median.
- Return type
Note
When the attribute being analyzed has a very narrow range of possible values, the median returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than
0.01
, the median returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by10/width
, compute the median, and rescale the returned value bywidth/10
.
-
leapyear.analytics.
quantile
(q, attr, dataset=None, drop_nulls=False)¶ Analysis: Compute a certain quantile q of an attribute.
- Parameters
q (
float
) – Quantile to compute, which must be between 0 and 1 inclusive.attr (
Union
[Attribute
,str
]) – The attribute to compute the quantile of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the quantile.
- Return type
Note
When the attribute being analyzed has a very narrow range of possible values, the quantile returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than
0.01
, the quantile returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by10/width
, compute the quantile, and rescale the returned value bywidth/10
.
-
leapyear.analytics.
skewness
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the skewness of an attribute.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the skewness of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the skewness.
- Return type
-
leapyear.analytics.
kurtosis
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the excess kurtosis of an attribute.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the kurtosis of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the kurtosis.
- Return type
-
leapyear.analytics.
iqr
(attr, dataset=None, drop_nulls=False)¶ Analysis: Compute the interquartile range of an attribute.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the interquartile range of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.drop_nulls (
bool
) – Whether to allow running on a nullable column, ignoring nulls.
- Returns
Prepared analysis of the iqr.
- Return type
-
leapyear.analytics.
histogram
(attr, dataset=None, bins=10, interval=None)¶ Analysis: Compute the histogram of the attribute in the dataset.
- Parameters
attr (
Union
[Attribute
,str
]) – The attribute to compute the histogram of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when attr is a string.bins (
int
) – Number of bins between the bounds. (default=10)interval (
Optional
[Tuple
[float
,float
]]) – The lower and upper bound of the histogram. Defaults to attribute bounds if None.
- Returns
Prepared analysis of the histogram.
- Return type
-
leapyear.analytics.
histogram2d
(x_attr, y_attr, dataset=None, x_bins=10, y_bins=10, x_range=None, y_range=None)¶ Analysis: Compute the 2D histogram of two attributes in the dataset.
- Parameters
x_attr (
Union
[Attribute
,str
]) – The attribute to use to compute the first dimension of the histogram. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.y_attr (
Union
[Attribute
,str
]) – The attribute to use to compute the first dimension of the histogram. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.dataset (
Optional
[DataSet
]) – The dataset to use when x_attr or y_attr are strings.x_bins (
int
) – Number of bins between the bounds in the first attribute.y_bins (
int
) – Number of bins between the bounds in the second attribute.x_range (
Optional
[Tuple
[float
,float
]]) – The lower and upper bound of the first attribute for the histogram.y_range (
Optional
[Tuple
[float
,float
]]) – The lower and upper bound of the second attribute for the histogram.
- Returns
Prepared analysis of the histogram.
- Return type
-
leapyear.analytics.
correlation_matrix
(xs, dataset, *, center=True, scale=True, **kwargs)¶ Analysis: Compute the correlation matrix of the set of attributes.
NOTE: This analysis does not require
run()
.- Parameters
xs (
Sequence
[str
]) – A list of attribute names to compute correlation matrix for.dataset (
DataSet
) – The DataSet containing these attributes.center (
bool
) – Whether to center the columns before computing correlation matrix. If False, proceed assuming the columns are already centered.scale (
bool
) – Whether to divide covariance matrix by number of rows. If False, do not divide.max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to
None
, this function will poll the server indefinitely. If it is run withscale
orcenter
set to True, the timeout will be multiplied. Defaults to waiting forever.
- Returns
The correlation matrix.
- Return type
np.ndarray
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
covariance_matrix
(xs, dataset, *, center=True, scale=True, **kwargs)¶ Analysis: Compute the covariance matrix of the set of attributes.
NOTE: This analysis does not require
run()
.- Parameters
xs (
Sequence
[str
]) – A list of attribute names that are the features.dataset (
DataSet
) – The DataSet of the attributes.center (
bool
) – Whether to center the columns before compute the covariance matrix. If False, assume the columns are centered.scale (
bool
) – Whether to divide the matrix by number of rows. If False, do not divide.max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to
None
, this function will poll the server indefinitely. If it is run withscale
orcenter
set to True, the timeout will be multiplied. Defaults to waiting forever.
- Returns
The covariance matrix.
- Return type
np.ndarray
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
describe
(dataset, attributes=None)¶ Describe the columns of the dataset for use in data exploration.
The describe function provides a way for an analyst to perform initial rough data exploration on a dataset. To get more accurate statistics, the individual functions
mean()
,count()
, et cetera, are recommended. This function does not use the analysis cache of the other statistics functions.Numeric columns are described by their count, mean, standard deviation, minimum, maximum and the quartiles. Categorical columns (factors and booleans) are described by their count, distinct count and frequency of the most frequent element.
- Parameters
- Returns
Prepared analysis for describing the dataset. Execute the analysis using the
run()
method.- Return type
-
leapyear.analytics.
groupby_agg_view
(dataset, attrs, agg_attr=None, agg_type=<GroupByAggType.COUNT: 1>, *, max_groupby_agg_keys=100000000, size_threshold=None, agg_attr_and_type=None)¶ Compute aggregate statistic within each group and output aggregate results.
Only groups with estimated size larger than
minimum_dataset_size
will be returned. This parameter can be set inrun
.- Parameters
dataset (
DataSet
) – The DataSet to perform groupby and aggregation onattrs (
Sequence
[Union
[Attribute
,str
]]) – List of attributes to perform groupby.agg_attr (
Optional
[str
]) – Compute aggregate statistics on this column within each groupagg_type (
Union
[GroupByAggType
,str
]) – Aggregate type. ‘count’, ‘mean’ or ‘sum’.max_groupby_agg_keys (
int
) – This value prevents submitting computations that have a very large number of groupby keys. By default, it raises GroupbyAggTooManyKeysError if the number of groups exceeds 100000000.size_threshold (
Optional
[int
]) – Deprecated: seeminimum_dataset_size
in therun
method.agg_attr_and_type (
Union
[Tuple
[Union
[GroupByAggType
,str
],Optional
[str
]],Sequence
[Tuple
[Union
[GroupByAggType
,str
],Optional
[str
]]],None
]) – List of tuples (agg_type, agg_attr). Compute aggregate statistics defined by the agg_type on the column within each group. agg_type can be ‘count’, ‘mean’ or ‘sum’.
- Returns
Analysis object that can be executed using
run
method to return aggregation results. The results can be accessed as a pandas dataframe using.to_dataframe()
.- Return type
Note: privacy exposure estimate for this analysis is not supported.
Example
For each age group and gender, compute the mean income.
>>> groupby_agg_view(ds, ["AGE", "GENDER"], "INCOME", "mean").run(minimum_dataset_size=1000)
For each week, compute the mean and total transaction amount.
>>> groupby_agg_view(ds, ["WEEK"], agg_attr_and_type=[("mean", "AMOUNT"), ("sum", "AMOUNT")]).run()
Look at Randomization Intervals for each group (only for ‘count’ and ‘sum’).
>>> rr = groupby_agg_view(ds, ["WEEK"], "AMOUNT", "mean").run(rich_result=True) >>> ri_dict = rr.metadata >>> ri_dict { (1, ): RandomizationInterval(...), (2, ): RandomizationInterval(...) ... } >>> ri_dict[(1, )] RandomizationInterval(...)
Look at Randomization Interval for multiple aggregate results.
>>> rr = groupby_agg_view(ds, ["YEAR", "WEEK"], agg_attr_and_type=[("mean", "AMOUNT"), ("sum", "AMOUNT")]).run() >>> ri_dict = rr.metadata >>> ri_dict[(2020, 1)][0] RandomizationInterval(...)
Machine Learning¶
Unsupervised learning¶
-
leapyear.analytics.
kmeans
(xs, dataset, n_iters=10, n_clusters=3)¶ Analysis: K-means clustering.
Identifies centers of clusters for a set of data points, by
Randomly initializing a chosen number of cluster centers (centroids) in the feature space
Associating each data point with the nearest centroid
Iteratively adjusting centroids to locations based on differentially private computation of the mean for each feature
- Parameters
- Returns
Analysis object that can be executed using the
run()
method. Once executed, it would output clustering analysis results, such as centroids.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
eval_kmeans
(centroids, xs, dataset)¶ Analysis: Evaluate the K-means model.
Evaluate the clustering model by computing the Normalized Intra Cluster Variance (NICV).
- Parameters
centroids (
ClusterModel
) – The model (generated using kmeans) to evaluatexs (
List
[str
]) – A list of attribute names that are the features.dataset (
DataSet
) – The DataSet of the attributes.
- Returns
Analysis representing evaluation of a clustering model. It can be executed using the
run()
method to output evaluation metric value.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
pca
(xs, dataset, **kwargs)¶ Principal Component Analysis.
Compute the Principal Component Analysis (PCA) of the set of attributes using a differentially private algorithm.
NOTE: This analysis does not require
run()
.- Parameters
xs (
List
[str
]) – A list of attribute names representing features to be considered for this analysis.dataset (
DataSet
) – DataSet that includes these attributes.max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to
None
, this function will poll the server indefinitely. If it is run withscale
orcenter
set to True, the timeout will be multiplied. Defaults to waiting forever.
- Return type
Tuple
[ndarray
,ndarray
]- Returns
explained_variances – Variance explained by each of the principal components - in other words, variance of each principal component coordinate when considered as feature on the input dataset.
pca_matrix – Transformation matrix, that can be used to translate original features to principal component coordinates. If all principal components are included, this becomes a square matrix corresponding to orthogonal transformation (e.g. reflection).
This matrix can be used to generate principal component features using
leapyear.dataset.DataSet.transform()
operation, as in:tfds = ds.transform(x_vars, pca_matrix, 'pca')
NOTE: Signs may not match PCA transformation matrix computed by scikit-learn.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
Supervised learning¶
-
leapyear.analytics.
basic_linreg
(xs, y, dataset, *, affine=True, l1reg=0.0, l2reg=1.0, parameter_bounds=None)¶ Analysis: Linear regression.
Implements a differentially private algorithm to represent outcome (target) variable as a linear combination of selected features.
Note
To help ensure that the differentially private training process can effectively optimize regression coefficients, it’s important to re-scale features (both dependent/target and independent/explanatory) to a similar domain (e.g. [0,1]). This can be done using
leapyear.feature.BoundsScaler
and will help ensure that the domain being searched for the coefficient will include the optimal model. See LeapYear guides for this and other recommendations on training accurate regressions.Note
Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.
Please see the guides for more details on using regressions in LeapYear.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.affine (
bool
) – IfTrue
, fit an intercept term.l1reg (
float
) – The L1 regularization. Default value:0.0
.l2reg (
float
) – The L2 regularization. Default value:1.0
. Must be at least0.0001
to limit the randomization effect for models optimized via objective perturbation.parameter_bounds (
Optional
[List
[Tuple
[float
,float
]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value:-10.0 to 10.0 for each parameter
- Returns
Analysis representing the regression problem. It can be executed using the
run()
method to output calibrated model.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
generalized_linreg
(xs, y, dataset, *, affine=True, l2reg=1.0, weight=None, offset=None, max_iters=25, family='gaussian', link='identity', link_power=0, variance_power=1)¶ Analysis: Generalized linear regression.
Implements a differentially private algorithm to represent outcome (target) variable as a linear combination of selected features. This computation is based on the iterative weighted least squares algorithm. Trains using the “glm” algorithm.
Available generalizations include:
offset of outputs based on pre-existing model - this enables modeling of residual
use of alternative link functions applied to the linear combination of features
application of regularization and weights during model optimization.
Note
Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.
Please see the guides for more details on using regressions in LeapYear.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list ofAttributes
or attribute names to be used as features.y (
Union
[Attribute
,str
]) – TheAttribute
or attribute name of the target.dataset (
DataSet
) – The DataSet of the attributes.affine (
bool
) –True
if the algorithm should fit an intercept term.l2reg (
float
) – The L2 regularization parameter. Must be non-negative.weight (
Union
[Attribute
,str
,None
]) – Optional column used to weight each sample. Implies generalized regression.offset (
Union
[Attribute
,str
,None
]) – Optional column for offset in offset regression. Implies generalized regression.max_iters (
Optional
[int
]) – Optional maximum number of iterations for fitting the regression. Note that regardless of this setting, the system would often stop before reaching max_iterations - e.g. after a single iteration. In such cases, higher value for max_iterations may lead to less privacy allocated to each iteration, and ultimately, higher randomization effect.family (
Optional
[str
]) – Optional distribution of the label. Implies generalized regression. Possible values here are ‘gaussian’ (the default), ‘poisson’, ‘gamma’ and ‘tweedie’.link (
Optional
[str
]) – Optional link function between mean of label distribution and prediction. Implies generalized regression. Possible values depend on family: ‘gaussian’ supports only ‘identity’ (default), ‘log’ and ‘inverse’; ‘poisson’ supports only ‘log’ (default), ‘identity’ and ‘sqrt’; ‘gamma’ supports only ‘inverse’ (default), ‘identity’ and ‘log’. There is no link function for the ‘tweedie’ family, use variance_power and link_power parameters instead.link_power (
Optional
[int
]) – For the ‘tweedie’ distribution only, the exponent of the link function. Default value is 0, which is equivalent to ‘identity’ link.variance_power (
Optional
[int
]) – For the ‘tweedie’ distribution only, the exponent of the variance. Default value is 1, which is equivalent to ‘gaussian’ family.
- Returns
Analysis of the regression problem, which could be executed using
run()
function to output calibrated model.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
logreg
(xs, y, dataset, affine=True, l1reg=0.0, l2reg=1.0)¶ Analysis: Logistic regression.
Implements a differentially private algorithm to represent outcome (target) variable as a logit-transformation of a linear combination of selected features. Trains using the “basic” algorithm.
Available generalizations include
regularization applied during model optimization
Note
To help ensure that the differentially private training process can effectively optimize regression coefficients, it’s important to re-scale features (both dependent/target and independent/explanatory) to a similar domain (e.g. [0,1]). This can be done using
leapyear.feature.BoundsScaler
and will help ensure that the domain being searched for the coefficient will include the optimal model. See LeapYear guides for this and other recommendations on training accurate regressions.Note
Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.
Please see the guides for more details on using regressions in LeapYear.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.affine (
bool
) – IfTrue
, fit an intercept term.l1reg (
float
) – The L1 regularization. Default value:0.0
.l2reg (
float
) – The L2 regularization. Default value:1.0
.
- Returns
Analysis training the logistic regression model. It can be executed using the
run()
method to output the calibrated model.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
generalized_logreg
(xs, y, dataset, *, affine=True, l1reg=1.0, l2reg=1.0, weight=None, offset=None, max_iters=25, link='logit')¶ Analysis: Generalized logistic regression.
Implements a differentially private algorithm to represent the outcome (target) variable as a logit-transformation of a linear combination of selected features. This computation is based on the iterative weighted least squares algorithm. Trains using the “glm” algorithm.
Available generalizations include:
offset of outputs based on pre-existing model - this enables modeling of residual
use of alternative link functions applied to the linear combination of features
application of regularization and weights during model optimization.
Note
Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.
Please see the guides for more details on using regressions in LeapYear.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list ofAttributes
or attribute names to be used as features.y (
Union
[Attribute
,str
]) – TheAttribute
or attribute name of the target.dataset (
DataSet
) – The DataSet of the attributes.affine (
bool
) –True
if the algorithm should fit an intercept term.l2reg (
float
) – The L2 regularization parameter. Must be non-negative.weight (
Union
[Attribute
,str
,None
]) – Optional column used to weight each sample. Implies generalized regression.offset (
Union
[Attribute
,str
,None
]) – Optional column for offset in offset regression. Implies generalized regression.max_iters (
Optional
[int
]) – Optional maximum number of iterations for fitting the regression. Note that regardless of this setting, the system would often stop before reaching max_iterations - e.g. after a single iteration. In such cases, higher value for max_iterations may lead to less privacy allocated to each iteration, and ultimately, higher randomization effect.link (
Optional
[str
]) – Optional link function between mean of label distribution and prediction. Implies generalized regression. Possible values are ‘logit’ (default), ‘probit’ and ‘cloglog’.
- Returns
Analysis of the regression problem, which could be executed using
run()
function to output calibrated model.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
gradient_boosted_tree_classifier
(xs, y, dataset, max_depth=3, max_iters=5, max_bins=32)¶ Analysis: Gradient boosted tree classifier.
This analysis trains a randomized variant of gradient boosted tree classifier to predict a BOOLEAN outcome (target).
The algorithm works by iteratively training individual decision trees to predict a “residual” of the model built so far, and then integrating each newly built decision tree into the ensemble model to better predict the probability of the positive label.
Weights are used at different stages:
during training of individual decision trees, to focus attention on the areas where the model consistently underperforms, and
when combining individual decision trees to predict probability of the positive label.
Calibrated level of randomization is applied to individual leaves of the decision trees to help protect privacy of the individual records used for model training.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attributes or attribute names that are used as explanatory features for the analysis. Each attribute must be eitherBOOL
,INT
,REAL
orFACTOR
. Nullable types are not supported and must be converted to non-nullable - e.g. viacoalesce
.y (
Union
[Attribute
,str
]) – The attribute or attribute name that is used as an outcome (target) of the classification model. Must be BOOLEAN type, as only binary classification models are supported. Nullable types are not supported and must be converted to non-nullable - e.g. viacoalesce
.dataset (
DataSet
) – The DataSet containing both explanatory features and outcome attributes.max_depth (
int
) – The maximum depth (or height) of any tree in the ensemble produced by the algorithm. Default: 3max_iters (
int
) – The maximum number of iterations of the algorithm. This corresponds to the maximum number of individual decision trees in the ensemble. Default: 5max_bins (
int
) –The maximum number of bins for features used in constructing trees. Default: 32
Note
Maximum number of bins should be set to no less than the number of distinct possible values of the FACTOR attributes used as explanatory features.
- Returns
Analysis that will train the gradient boosted tree classifier. It can be executed using the
run()
method.- Return type
See also
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
random_forest
(xs, y, dataset, n_trees=100, height=3)¶ Analysis: Random Forest Classifier.
Generate a random forest model to predict probability associated with each target class.
Random forests combine many decision trees in order to reduce the risk of overfitting.
Each decision tree is developed on a random subset of observations - and is limited to prescribed height.
Individual node split decisions are made to maximize split value (or gain) - with a variation that a differentially private algorithm is used to count the number of observations belonging to each target class on both sides of the split.
Specifically, split value (or gain) is defined as reduction in combined Gini impurity measure, associated with introducing the split for a given parent node. Here
Gini impurity for any given node (parent or child) is calculated based on distribution of observations within the node across different outcome (target) classes
To compute combined impurity of the pair of nodes, individual node impurities for the two children nodes are averaged proportionately to their share of observations
Categorical features are typically handled by evaluating various splits corresponding to random subsets of the available categories.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are used as features for explanatory analysis.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome (target).dataset (
DataSet
) – The DataSet containing both explanatory features and outcome attributes.n_trees (
int
) – The number of trees to use in the random forest. Default: 100height (
int
) – The maximum height of the trees. Default: 3
- Returns
Analysis training the random forest model. It can be executed using the
run()
method to output the analysis results which include the calibrated random forest model, feature importance statistics, etc.- Return type
See also
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
regression_trees
(xs, y, dataset, n_trees=100, height=3)¶ Analysis: Random Forest Regressor (regression trees).
Generate a regression trees model to predict value of target variable.
Regression trees are built similarly to random forests, but instead of predicting the probability that the target variable takes a certain categorical value (i.e., classification), they predict a real value of the target variable (i.e., regression).
The impurity metric in this case is the variance of the target variable for the datapoints that fall into the current node’s partition.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are used as features for explanatory analysis.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome (target).dataset (
DataSet
) – The DataSet containing both explanatory features and target attribute.n_trees (
int
) – The number of trees to use in the random forest. Default: 100.height (
int
) – The maximum height of the trees. Default: 3.
- Returns
Analysis training the regression trees model. It can be executed using the
run()
method to output the analysis results which include the calibrated random forest model, feature importance statistics, etc.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
eval_linreg
(glm, xs, y, dataset, metric='mse')¶ Analysis: Evaluate a linear regression model.
- Parameters
glm (
GLM
) – The model (generated using linreg) to evaluatexs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.Linear regression evaluation metric: ‘mse’/’mean_squared_error’ or ‘mae’/’mean_absolute_error’.
Note
During the calculation of
mse
andmae
metrics, the individual values of absolute error are restricted to be no greater than the length of the interval of possible values of the target attribute, as seen in the dataset schema. For example, if the target attribute contains values in the interval [-50, 50], then the absolute error of any individual prediction used in computing the mean will be no greater than 100.
- Returns
Analysis representing evaluation of a regression model. It can be executed using the
run()
method to output evaluation metric value.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
eval_logreg
(glm, xs, y, dataset, metric='accuracy')¶ Analysis: Evaluate a logistic regression model.
- Parameters
glm (
GLM
) – The model (generated using logreg) to evaluatexs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.metric (
Union
[str
,Metric
]) – Logistic regression evaluation metric. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’.
- Returns
Analysis representing evaluation of a logistic regression model. It can be executed using the
run()
method to output evaluation metric value.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
eval_gbt_classifier
(gbt, xs, y, dataset, metric='accuracy')¶ Analysis: Evaluate a gradient boosted tree (GBT) classifier model.
- Parameters
gbt (
GradientBoostedTreeClassifier
) – The model to evaluate.xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.metric (
Union
[str
,Metric
]) – GBT evaluation metric. Currently only supports ‘accuracy’.
- Returns
Analysis representing evaluation of a GBT classifier model. It can be executed using the
run()
method to output the value of the evaluation metric.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
eval_random_forest
(rf, xs, y, dataset, metric='accuracy')¶ Analysis: Evaluate a random forest model.
- Parameters
rf (
RandomForestClassifier
) – The model (generated using random_forest) to evaluatexs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.metric (
Union
[str
,Metric
]) – Forest evaluation metric. Examples: ‘logloss’, ‘accuracy’, ‘auroc’, ‘aupr’
- Returns
Analysis representing evaluation of a random forest model. It can be executed using the
run()
method to output evaluation metric value.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
eval_regression_trees
(rf, xs, y, dataset, metric='mse')¶ Analysis: Evaluate a regression trees model.
- Parameters
rf (
RandomForestClassifier
) – The model (generated using regression_trees) to evaluatexs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.Model evaluation metric. Examples: ‘mse’/’mean_squared_error’ or ‘mae’/’mean_absolute_error’.
Note
During the calculation of
mse
andmae
metrics, the individual values of absolute error are restricted to be no greater than the length of the interval of possible values of the target attribute, as seen in the dataset schema. For example, if the target attribute contains values in the interval [-50, 50], then the absolute error of any individual prediction used in computing the mean will be no greater than 100.
- Returns
Analysis representing evaluation of a regression trees model. It can be executed using the
run()
method to output evaluation metric value.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
roc
(model, xs, y, dataset, thresholds=5)¶ Compute the ConfusionCurves.
For each threshold value, compute the normalized confusion matrix using the model. The confusion matrix contains the true positive rate, the true negative rate, the false positive rate and the false negative rate.
- Parameters
model (
Union
[GLM
,RandomForestClassifier
]) – The model to evaluate the confusion curves on.xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.thresholds (
Union
[int
,Sequence
[float
]]) – If int, then generate approximately thresholds (rounded to the closest power of 2) number of thresholds using recursive medians. If a sequence of floats, then use the list as the thresholds.
- Returns
Analysis of the confusion curve, which can be executed using the
run()
method to output various evaluation metrics.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
cross_val_score_linreg
(xs, y, dataset, *, affine=True, l1reg=1.0, l2reg=1.0, cv=3, metric='mean_squared_error', parameter_bounds=None)¶ Analysis: Compute the linear regression cross validation score of the set of attributes.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.affine (
bool
) – IfTrue
, fit an intercept term.l1reg (
float
) – The L1 regularization. Default value:1.0
.l2reg (
float
) – The L2 regularization. Default value:1.0
. Must be at least0.0001
to limit the randomization effect.cv (
int
) – Number of folds in k-fold cross validation.metric (
Union
[str
,Metric
]) – The metric for evaluating the regression. Examples: ‘mae’, ‘mse’, ‘r2’.parameter_bounds (
Optional
[List
[Tuple
[float
,float
]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value:-10.0 to 10.0 for each parameter
- Returns
Analysis of the cross-validation scores for the regression model. It can be executed using the
run()
method to generate cross-validation results.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
cross_val_score_logreg
(xs, y, dataset, cv=3, affine=True, l1reg=0.1, l2reg=0.1, metric='accuracy')¶ Analysis: Compute the logistic regression cross validation score of the set of attributes.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.affine (
bool
) – IfTrue
, fit an intercept term.l1reg (
float
) – The L1 regularization. Default value:0.1
.l2reg (
float
) – The L2 regularization. Default value:0.1
. Must be at least0.0001
to limit the randomization effect.cv (
int
) – Number of folds in k-fold cross validation.metric (
Union
[str
,Metric
]) – The metric for evaluating the logistic regression. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’.
- Returns
Analysis of the cross-validation scores for the regression model. It can be executed using the
run()
method to generate cross-validation results.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
cross_val_score_random_forest
(xs, y, dataset, n_trees=100, height=3, cv=3, metric='mean_squared_error')¶ Analysis: Compute the random forest cross validation score of the set of attributes.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.n_trees (
int
) – Number of trees.height (
int
) – Maximum height of trees.cv (
int
) – Number of folds in k-fold cross validationmetric (
Union
[str
,Metric
]) – The metric for evaluating the regression
- Returns
Analysis of the cross-validation scores for the random forest model. It can be executed using the
run()
method to generate cross-validation results.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
cross_val_score_regression_trees
(xs, y, dataset, n_trees=100, height=3, cv=3, metric='mean_squared_error')¶ Analysis: Compute the regression trees cross validation score of the set of attributes.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attribute names that are the features.y (
Union
[Attribute
,str
]) – The attribute name that is the outcome.dataset (
DataSet
) – The DataSet of the attributes.n_trees (
int
) – Number of trees.height (
int
) – Maximum height of trees.cv (
int
) – Number of folds in k-fold cross validationmetric (
Union
[str
,Metric
]) – The metric for evaluating the regression
- Returns
Analysis of the cross-validation scores for the regression trees model. It can be executed using the
run()
method to generate cross-validation results.- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
hyperopt_linreg
(xs, y, dataset, *, cv, train_fraction, metric, n_iter=100, l1_bounds=(1e-10, 10000000000.0), l2_bounds=(1e-10, 10000000000.0), fit_intercept=None, parameter_bounds=None)¶ Analysis: Hyperparameter optimization for linear regression.
Calibrate a linear regression model by optimizing its cross-validation score with respect to model hyperparameters - L_1 and L_2 regularization parameters and presence of intercept.
See below for a pseudo-code of the algorithm:
Split the dataset into ds_train_val/ds_holdout based on train_fraction. Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val). Initialize cv_history = [] For 1..n_iter pick a set of hyperparameters (hp) to test based on cv_history. use hp to calibrate a model on each cross-validation set evaluate it on corresponding sample set-aside for cross-validation compute an average cv score and append it to cv_history. Pick the hyper parameters with the best cv score. Train a model using the complete ds_train_val data set. Evaluate the model on the holdout data set. Return resulting model and its performance on the holdout set.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attributes that are the features.dataset (
DataSet
) – The dataset containing the attributes.cv (
int
) – The number of cross-validation steps to perform for each candidate set of hyperparameters.train_fraction (
float
) – The fraction of the dataset to use set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.metric (
Union
[str
,Metric
]) – Model performance metric to optimize. Examples: ‘mean_squared_error’, ‘mean_absolute_error’, ‘r2’n_iter (
int
) – The number of optimization steps. Default: 100l1_bounds (
Tuple
[float
,float
]) – Lower and upper bounds for l1 regularization. Default: (1E-10, 1E10)l2_bounds (
Tuple
[float
,float
]) – Lower and upper bounds for l2 regularization. Default: (1E-10, 1E10)fit_intercept (
Optional
[bool
]) – IfNone
, search will consider both options.parameter_bounds (
Optional
[List
[Tuple
[float
,float
]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value:-10.0 to 10.0 for each parameter
- Returns
Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the
run()
method to output the analysis results, includingmodel calibrated with recommended hyperparameters and
its performance on the holdout dataset.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
hyperopt_logreg
(xs, y, dataset, cv, train_fraction, metric, n_iter=100, l1_bounds=(1e-10, 10000000000.0), l2_bounds=(1e-10, 10000000000.0), fit_intercept=None)¶ Analysis: Hyperparameter optimization for logistic regression.
Calibrate a logistic regression model by optimizing its cross-validation score with respect to model hyperparameters - L_1 and L_2 regularization parameters and presence of intercept.
See below for a pseudo-code of the algorithm:
Split the dataset into ds_train_val/ds_holdout based on train_fraction. Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val). Initialize cv_history = [] For 1..n_iter pick a set of hyperparameters (hp) to test based on cv_history. use hp to calibrate a model on each cross-validation set evaluate it on corresponding sample set-aside for cross-validation compute an average cv score and append it to cv_history. Pick the hyper parameters with the best cv score. Train a model using the complete ds_train_val data set. Evaluate the model on the holdout data set. Return resulting model and its performance on the holdout set.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attributes that are the features.dataset (
DataSet
) – The dataset containing the attributes.cv (
int
) – The number of cross-validation steps to perform for each candidate set of hyperparameters.train_fraction (
float
) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.metric (
Union
[str
,Metric
]) – Model performance metric to optimize. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’.n_iter (
int
) – The number of optimization steps. Default: 100l1_bounds (
Tuple
[float
,float
]) – Lower and upper bounds for l1 regularization. Default: (1E-10, 1E10)l2_bounds (
Tuple
[float
,float
]) – Lower and upper bounds for l2 regularization. Default: (1E-10, 1E10)fit_intercept (
Optional
[bool
]) – IfNone
, search will consider both options.
- Returns
Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the
run()
method to output the analysis results, includingmodel calibrated with recommended hyperparameters and
its performance on the holdout dataset.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
hyperopt_rf
(xs, y, dataset, cv, train_fraction, metric, n_iter=100, max_trees=1000, max_depth=20)¶ Analysis: Hyperparameter optimization for a random forest model.
Calibrate a random forest model by optimizing its cross-validation score with respect to model hyperparameters - number of trees and individual tree depth (or height) limit.
See below for a pseudo-code of the algorithm:
Split the dataset into ds_train_val/ds_holdout based on train_fraction. Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val). Initialize cv_history = [] For 1..n_iter pick a set of hyperparameters (hp) to test based on cv_history. use hp to calibrate a model on each cross-validation set evaluate it on corresponding sample set-aside for cross-validation compute an average cv score and append it to cv_history. Pick the hyper parameters with the best cv score. Train a model using the complete ds_train_val data set. Evaluate the model on the holdout data set. Return resulting model and its performance on the holdout set.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attributes that are the features.dataset (
DataSet
) – The dataset containing the attributes.cv (
int
) – The number of cross-validation steps to perform for each candidate set of hyperparameters.train_fraction (
float
) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.metric (
Union
[str
,Metric
]) – The metric to optimize. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’n_iter (
int
) – The number of optimization steps. Default: 100max_trees (
int
) – Maximum number of trees. Default: 1000max_depth (
int
) – Maximum tree depth. Default: 20
- Returns
Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the
run()
method to output the analysis results, includingmodel calibrated with recommended hyperparameters and
its performance on the holdout dataset.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
leapyear.analytics.
hyperopt_regression_trees
(xs, y, dataset, cv, train_fraction, metric, n_iter=100, max_trees=1000, max_depth=20)¶ Analysis: Hyperparameter optimization for a regression trees model.
Calibrate a regression trees model by optimizing its cross-validation score with respect to model hyperparameters - number of trees and individual tree depth (or height) limit.
See below for a pseudo-code of the algorithm:
Split the dataset into ds_train_val/ds_holdout based on train_fraction. Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val). Initialize cv_history = [] For 1..n_iter pick a set of hyperparameters (hp) to test based on cv_history. use hp to calibrate a model on each cross-validation set evaluate it on corresponding sample set-aside for cross-validation compute an average cv score and append it to cv_history. Pick the hyper parameters with the best cv score. Train a model using the complete ds_train_val data set. Evaluate the model on the holdout data set. Return resulting model and its performance on the holdout set.
- Parameters
xs (
List
[Union
[Attribute
,str
]]) – A list of attributes that are the features.dataset (
DataSet
) – The dataset containing the attributes.cv (
int
) – The number of cross-validation steps to perform for each candidate set of hyperparameters.train_fraction (
float
) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.metric (
Union
[str
,Metric
]) – The metric to optimize. Examples: ‘mae’, ‘mse’, ‘r2’n_iter (
int
) – The number of optimization steps. Default: 100max_trees (
int
) – Maximum number of trees. Default: 1000max_depth (
int
) – Maximum tree depth. Default: 20
- Returns
Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the
run()
method to output the analysis results, includingmodel calibrated with recommended hyperparameters and
its performance on the holdout dataset.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
Context Managers¶
-
leapyear.analytics.
ignore_computation_cache
()¶ Temporary context where computations do not utilize the computation cache.
The computation cache is intended to prevent wasting privacy exposure on queries that were previously computed. Entering this context manager will disable the use of the cache and allow repeated computations to return different differentially private answers.
Example
An administrator wants to run a count multiple times to estimate the random distribution of responses around the precise value.
>>> with ignore_computation_cache(): >>> results = [la.count_rows(table).run() for _ in range(10)]
See also
Note
Additional permissions may be required to disable the computation cache.
- Return type
-
leapyear.analytics.
precise_computations
(precise=True)¶ Temporary context specifying if the computations are precise or not.
Computations requested within this context would be executed in precise mode, where differential privacy is not applied.
- Parameters
precise (
bool
) – True to enable precise computations within the context, False to disable them.
Example
An administrator wants to compare the responses of a number of computations with and without differential privacy applied. Precise mode may not be available for all computations.
>>> def my_computation(): >>> symbols = ("APPL", "GOOG", "MSFT"): >>> return [la.count_rows(table.where(col("SYM") == lit(val)).run() for val in symbols] >>> >>> res_dp = my_computation() >>> with precise_computations(): >>> res_no_dp = my_computation()
See also
Note
Additional permissions may be necessary to enable precise computations.
- Return type
Save/Load Models¶
LeapYear save and load machine learning models utilities.
-
leapyear.ml_import_export.
save
(model, path_or_fd)¶ Save machine learning models in json to either a file or a file-like object.
- Parameters
model (
Union
[ClusterModel
,GLM
,GradientBoostedTreeClassifier
,RandomForestClassifier
,RandomForestRegressor
,RichResult
[Union
[ClusterModel
,GLM
,GradientBoostedTreeClassifier
,RandomForestClassifier
,RandomForestRegressor
],Any
]]) – Any machine learning model executed using therun()
method.path – The path where to save the file in the file system or a descriptor for an in-memory stream.
Example
>>> from leapyear.ml_import_export import save >>> save(model, 'model.json')
- Return type
-
leapyear.ml_import_export.
load
(path_or_fd, expected_model_type=None, **kwargs)¶ Load machine learning models from a file-like object.
- Parameters
path – The path in the file system or an in-memory stream from where to load the model.
expected_mode_type – If None it won’t check that the model being loaded is of the type specified. Otherwise it checks that the model loaded is of the type expected.
rf_type – When loading RandomForest models with serialization number 0, setting this to “classification” or “regression” will load the model as a RandomForestClassifier or RandomForestRegressor objects, respectively. If not specified, a RandomForest model will raise an error. The value is ignored for all other model types.
Examples
Loading a previously saved model of unspecified type
>>> from leapyear.ml_import_export import load >>> model = load('model.json')
Loading a previously saved RandomForestClassifier model
>>> from leapyear.ml_import_export import load >>> model = load('random_forest_classifier.json', RandomForestClassifier)
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
Union
[ClusterModel
,GLM
,GradientBoostedTreeClassifier
,RandomForestClassifier
,RandomForestRegressor
]
Module leapyear.analytics.classes¶
Classes related to LeapYear analyses.
Analysis Classes¶
This section documents classes that define and process analyses in the LeapYear system.
Main Classes¶
-
class
leapyear.analytics.classes.
Analysis
(analysis, relation)¶ Any analysis that can be performed on the LeapYear server.
It contains an analysis and a relation on which the analysis should be performed.
-
check
(*, cache=None, allow_max_budget_allocation=None, precise=None, **kwargs)¶ Check the analysis for errors.
If any errors are present, the function will raise a descriptive error. If no errors are found, then the function will return self.
- Return type
Analysis
[~_Result, ~_Model, ~_ModelMetadata]
-
run
(*, detach: Literal[True], cache: Optional[bool] = None, allow_max_budget_allocation: Optional[bool] = None, precise: Optional[bool] = None, rich_result: bool = False, max_timeout_sec: Optional[float] = None, minimum_dataset_size: Optional[int] = None, **kwargs: Any) → leapyear.analytics.classes.AsyncAnalysis[_Result, _Model, _ModelMetadata]¶ -
run
(*, rich_result: Literal[True], cache: Optional[bool] = None, allow_max_budget_allocation: Optional[bool] = None, precise: Optional[bool] = None, max_timeout_sec: Optional[float] = None, minimum_dataset_size: Optional[int] = None, **kwargs: Any) → leapyear.analytics.classes.RichResult[_Model, _ModelMetadata] -
run
(*, cache: Optional[bool] = None, allow_max_budget_allocation: Optional[bool] = None, precise: Optional[bool] = None, max_timeout_sec: Optional[float] = None, minimum_dataset_size: Optional[int] = None, **kwargs: Any) → _Model Run analysis.
- Parameters
detach – If
True
when the analysis is sent to the server, it will return immediately with an AsyncAnalysis object. The analysis will be evaluated in the background on the server. The client can check the result of the analysis later using the AsyncAnalysis object.cache – If
True
, then the first time this analysis is executed on the LeapYear server, the result will be cached. Subsequent calls to the identical analysis (withcache=True
) will fetch the cached version and not contribute to the security cost. IfNone
, the default caching behavior will be obtained from the connection.allow_max_budget_allocation – Default is
True
. IfFalse
, raise anleapyear.exceptions.DataSetTooSmallException
when the randomness calibration system would run an analysis with the maximum privacy exposure per computation. IfNone
, the default value will be obtained from the connection.precise – When
True
, request an answer with no noise added to the computation.rich_result – When
True
, return a result with additional (potentially analysis-specific) metadata, including the privacy exposure expended in the process of performing the analysis. Defaults toFalse
.max_timeout_sec – When
detach=False
, specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set toNone
, the analysis will poll the server indefinitely. When computing on big data or long-running machine learning tasks, we recommend using thedetach=True
feature and use the functions provided inAsyncAnalysis
. Defaults to waiting forever.minimum_dataset_size – When
minimum_dataset_size
is set, prevent computations on data sets that have fewer rows than the specified value. We recommend using this when an analysis could filter down to a small number of records, potentially consuming more privacy budget than is desired. Setting this will spend a small amount of privacy budget to estimate the number of rows involved in a computation. This value is superseded by an admin-definedminimum_dataset_size
parameter, if the admin’s value is larger.
- Returns
The result of the analysis. Multiple return types are possible.
- Return type
Union[_Model, AsyncAnalysis, RichResult[_Model, _ModelMetadata]]
-
maximum_privacy_exposure
(minimum_dataset_size=None)¶ Maximum privacy exposure associated with running this analysis.
Estimate the maximum incremental privacy exposure that could result from running this computation for the current user. The result is represented as a percentage of privacy exposure limit for each data source.
Note: Estimating maximum privacy exposure may incur a small amount of privacy exposure.
- Parameters
minimum_dataset_size (
Optional
[int
]) – Whenminimum_dataset_size
is set, prevent computations on data sets that have fewer rows than the specified value.- Returns
A dictionary that maps a
TableIdentifier
to the estimated maximum fractional privacy exposure that running the analysis would incur for the associated table.- Return type
-
-
class
leapyear.analytics.classes.
AsyncAnalysis
(async_job_id, *, analysis, rich_result)¶ Asynchronous job for running analysis queries.
-
check_status
()¶ Check the status of the given asynchronous job.
- Return type
-
cancel
()¶ Cancel the job.
-
wait_to_cancel
(**kwargs)¶ Wait for the given asynchronous job to finish.
Same as ‘wait’, but doesn’t error on cancellations. Takes the same arguments as ‘wait’.
- Return type
-
Analysis Subclasses¶
This section describes the subclasses of Analysis
, generally determined by what type of output they produce.
-
class
leapyear.analytics.classes.
BoundsAnalysis
(analysis, relation)¶ Analysis that results in a lower and upper bound.
-
class
leapyear.analytics.classes.
ClusteringAnalysis
(analysis, relation)¶ Analysis that results in a clustering model.
-
class
leapyear.analytics.classes.
ConfusionModelAnalysis
(analysis, relation)¶ Analysis that results in a ConfusionCurve object.
-
class
leapyear.analytics.classes.
CountAnalysis
(analysis, relation)¶ Analysis that results in a scalar count value.
-
class
leapyear.analytics.classes.
CountAnalysisWithRI
(analysis, relation)¶ Analysis that computes a scalar count value.
The user can request additional information about the computation with
run(rich_result=True)
. ARandomizationInterval
object will be generated.
-
class
leapyear.analytics.classes.
CrossValidationAnalysis
(analysis, relation)¶ Analysis that results in multiple results of the same type.
-
class
leapyear.analytics.classes.
DescribeAnalysis
(analysis, relation)¶ Analysis that produces a model describing a dataset.
-
class
leapyear.analytics.classes.
FailAnalysis
(analysis, relation)¶ An analysis that always fails.
-
class
leapyear.analytics.classes.
ForestModelClassifierAnalysis
(analysis, relation)¶ Analysis that results in a forest model.
-
class
leapyear.analytics.classes.
ForestModelRegressionAnalysis
(analysis, relation)¶ Analysis that results in a forest model.
-
class
leapyear.analytics.classes.
GradientBoostedTreeClassifierModelAnalysis
(analysis, relation)¶ Gradient boosted tree classifier analysis.
When executed, this analysis returns a model object of class
GradientBoostedTreeClassifier
.
-
class
leapyear.analytics.classes.
GroupbyAggAnalysis
(analysis, relation)¶ Analysis that results in a
GroupbyAgg
.
-
class
leapyear.analytics.classes.
GenLinAnalysis
(analysis, relation)¶ Analysis that results in a generalized linear model.
-
class
leapyear.analytics.classes.
Histogram2DAnalysis
(analysis, relation)¶ Analysis that results in a 2d histogram.
-
class
leapyear.analytics.classes.
HistogramAnalysis
(analysis, relation)¶ Analysis that results in a histogram.
-
class
leapyear.analytics.classes.
HyperOptAnalysis
(analysis, relation)¶ Analysis that returns the result of hyperparameter optimization.
-
class
leapyear.analytics.classes.
MatrixAnalysis
(analysis, relation)¶ Analysis that results in a matrix of float values.
-
class
leapyear.analytics.classes.
ScalarAnalysis
(analysis, relation)¶ Analysis that results in a scalar float value.
-
class
leapyear.analytics.classes.
ScalarAnalysisWithRI
(analysis, relation)¶ Analysis that computes a scalar value.
The user can request additional information about the computation with
run(rich_result=True)
. In this case, aRandomizationInterval
object will be generated.This likely interval is likely to include the exact value of the computation on the data sample.
-
class
leapyear.analytics.classes.
ScalarFromHistogramAnalysis
(f, *args)¶ Analysis that uses a histogram to compute a scalar value.
-
class
leapyear.analytics.classes.
SleepAnalysis
(analysis, relation)¶ An analysis that will sleep for a set amount of microseconds.
-
class
leapyear.analytics.classes.
TypeAnalysis
(analysis, relation)¶ Analysis that results list of counts associated with types.
Rich Results¶
This section documents classes related to rich results and privacy exposure measurements.
-
class
leapyear.analytics.classes.
RandomizationInterval
(estimation_method: str, confidence_level: float, low: float, high: float)¶ An interval estimating the uncertainty in the exact answer, given randomized output.
A
RandomizationInterval
can be generated for a subset of the analyses offered by LeapYear by running an analysis withrun(rich_result = True)
.The interval between
low
andhigh
is expected to include the exact value of the computation on the data sample with the statedconfidence_level
(e.g. 95%).- Parameters
confidence_level (float) – The confidence that the exact answer lies within the interval.
low (float) – The lower bound of the interval.
high (float) – The upper bound of the interval.
estimation_method (str) –
The method used to compute the interval depends on analysis type.
With
'bayesian'
method, the randomization interval is obtained using a posterior distribution analysis based on non-informative prior, the knowledge of randomized output and the scale of the randomization effect applied.With
'approximate'
method, the randomization interval is estimated using simplified simulation process.Note
The
'approximate'
estimation method tends to produce biased intervals for small data samples.
-
class
leapyear.analytics.classes.
FractionalPrivacyExposure
¶ A fractional measure of privacy exposure for a collection of tables.
This is represented by a dictionary, mapping a
TableIdentifier
to the fraction of total privacy exposure that has been expended for the table corresponding to theTableIdentifier
.
Aggregate Results¶
This section documents classes related to aggregate resuls from group by operations.
-
class
leapyear.analytics.classes.
GroupbyAgg
(aggregate_type: Sequence[_ml.GroupbyAgg], key_columns: Sequence[Attribute], aggs: Mapping[Tuple[Any, …], float])¶ A GroupBy Aggregate.
-
aggregate_type
: Sequence[leapyear._tidl.protocol.query.select.machinelearning.GroupbyAgg]¶ Alias for field number 0
-
key_columns
: Sequence[leapyear.dataset.attribute.Attribute]¶ Alias for field number 1
-
to_dataframe
(groups_as_index=True)¶ Convert to a pandas DataFrame.
- Parameters
groups_as_index (bool, optional) – Whether the groupBy columns should be the
MultiIndex
of the resultingDataFrame
. IfFalse
, then the groupBy columns are made into columns of the output. By default True.- Returns
A
pandas
DataFrame
, with a multi-index corresponding to the key columns of the groupBy operation ifgroups_as_index
= True.- Return type
pd.DataFrame
-
Module leapyear.model¶
LeapYear models.
Data objects generated from training or evaluating models used in machine learning.
Regression-Based Models¶
-
class
leapyear.model.
GLM
(affinity: bool, l1reg: float, l2reg: float, model: GeneralizedLinearModel)¶ A representation of a trained Generalized Linear Model (GLM).
Differentially private versions of GLMs are calibrated using various methods, e.g.
leapyear.analytics.linreg()
,The variants of these methods that optimize model hyperparameters.
Objects of this class store parameters and structure of a regression model and can be used to generate predictions for regression and classification problems.
-
model
: leapyear._tidl.protocol.algorithms.generalizedlinearmodel.GeneralizedLinearModel¶ Alias for field number 3
-
property
coefficients
¶ Model coefficients, excluding intercepts.
- Return type
ndarray
-
property
intercepts
¶ Model intercepts, if any.
- Return type
ndarray
-
property
model_type
¶ Model type (e.g. linear, logistic).
- Return type
GeneralizedLinearModelType
-
decision_function
(xs)¶ Decision function of the generalized linear model.
Computes the height of the regression function (xbeta) at the provided points. This is purely linear transformation of the input features.
In case of logistic model, model would ultimately classify observations based on the sign of this decision function.
- Parameters
xs (
ndarray
) – a set of datapoints for which to predict- Returns
The predicted decision function
- Return type
np.ndarray
-
predict
(xs)¶ Prediction function of the generalized linear model.
For linear problems, returns the height of the regression line (decision function) at the data points provided.
For classification problems, returns boolean classification choice, which is based on the sign of this decision function.
- Parameters
xs (
ndarray
) – a set of datapoints for which to predict- Returns
the predictions for the points according to the model
- Return type
np.ndarray
-
predict_proba
(xs)¶ Probabilities given by generalized linear model.
For logistic classification problems, returns probability that the model assigns to a positive response (True outcome variable) for each of the data points provided.
- Parameters
xs (
ndarray
) – array with input data- Returns
array of probability scores assigned by the model
- Return type
np.ndarray
-
predict_log_proba
(xs)¶ Logarithm of probabilities given by generalized linear model.
For logistic classification problems, returns natural logarithm of probability that the model assigns to a True outcome for each of the data points provided.
- Parameters
xs (
ndarray
) – array with input data- Returns
array of log-probability scores assigned by the model
- Return type
np.ndarray
-
classmethod
from_dict
(cls, d)¶ Convert from a dictionary.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
-
to_shap
()¶ Convert the trained model to SHAP format.
The converted model can then be used to construct a LinearExplainer object able to generate Shapley explanations for new records to which the model would be applied to.
Note that:
model execution and generation of model score explanations is expected to be done in a production setting by an automated system with direct access to record-level information.
feature explanations for categorical features are currently not supported. Consider one-hot encoding features to get the benefits of explainable model scores.
Examples
>>> import shap >>> from leapyear import analytics as la >>> ... >>> glm_model = la.logreg(xs, y, ds).run() >>> glm_explainer = shap.LinearExplainer(glm_model.to_shap(), X_reference) >>> glm_shap_values = glm_explainer.shap_values(X_to_predict) >>> ...
In this example:
The input X_reference used to initialize the explainer object is a pandas.DataFrame containing explanatory variables in the same order as used to train models. It is used to infer what model scores and feature distribution should be considered “typical”.
The input X_to_predict is a pandas.DataFrame capturing the explanatory variables in the same order as used to train models.
See the SHAP Linear documentation for more information.
LeapYear has been tested with SHAP version 0.39.0. Older or newer versions are not guaranteed to work.
- Return type
_ShapGLM
Tree-Based Models¶
-
class
leapyear.model.
RandomForestClassifier
(ntrees: int, height: int, model: DecisionForest)¶ A representation of a trained Random Forest classification model.
Provides methods for making predictions and report on feature importance statistics.
-
model
: leapyear._tidl.protocol.algorithms.decisiontree.DecisionForest¶ Alias for field number 2
-
predict
(xs)¶ Prediction function of the random forest classification model.
For classification problems, returns the most likely class according to the model.
- Parameters
xs (
ndarray
) – array with input data- Returns
array of most likely outcome labels assigned by the model
- Return type
np.ndarray
-
predict_proba
(xs)¶ Prediction probability function of the random forest model.
For each of the data points provided, returns probability that the model assigns to any given outcome.
- Parameters
xs (
ndarray
) – array with input data- Returns
array of probability scores assigned by the model to input data points and possible outcomes
- Return type
np.ndarray
-
predict_log_proba
(xs)¶ Logarithm of probabilities given by random forest model.
For each of the data points provided, returns natural logarithm of probability that the model assigns to any given outcome.
- Parameters
xs (
ndarray
) – array with input data- Returns
array of log-probability scores assigned by the model to input data points and possible outcomes
- Return type
np.ndarray
-
property
feature_importance
¶ Relative feature importance.
Feature importances are derived based on the information collected during model training with differentially private computations, specifically:
For each tree and for each split of the tree, lookup value (gain) of introducing the split, as calculated on training data during model calibration - and attribute it to the splitting feature. See
leapyear.analytics.random_forest()
for specific calculation of split gain based on a notion of Gini impurity.To compute tree-specific feature importances, sum up split gains across all splits within each tree, weighted (multiplied) by parent node size, and re-scale these tree-specific feature importances to sum up to 1 for each tree.
Average feature importances across all trees in the random forest ensemble to get final feature importance.
- References:
Hastie, Tibshirani, Friedman. “The Elements of Statistical Learning, 2nd Edition.” 2001.
-
classmethod
from_dict
(cls, d)¶ Convert from a dictionary.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
-
to_shap
()¶ Convert the trained model to SHAP format.
The converted model can then be used to construct a TreeExplainer object able to generate Shapley explanations for new records to which the model would be applied to.
Note that:
model execution and generation of model score explanations is expected to be done in a production setting by an automated system with direct access to record-level information.
feature explanations for categorical features are currently not supported. Consider one-hot encoding features to get the benefits of explainable model scores.
Examples
>>> import shap >>> from leapyear import analytics as la >>> ... >>> rfc_model = la.random_forest (xs, y, ds).run() >>> rfc_explainer = shap.TreeExplainer(rfc_model.to_shap(), X_reference) >>> rfc_shap_values = rfc_explainer.shap_values(X_to_predict) >>> ...
In this example:
The input X_reference used to initialize the explainer object is a pandas.DataFrame containing explanatory variables in the same order as used to train models. It is used to infer what model scores and feature distribution should be considered “typical”.
The input X_to_predict is a pandas.DataFrame capturing the explanatory variables in the same order as used to train models.
See the SHAP Tree documentation for more information.
LeapYear has been tested with SHAP version 0.39.0. Older or newer versions are not guaranteed to work.
- Return type
-
-
class
leapyear.model.
RandomForestRegressor
(ntrees: int, height: int, model: DecisionForest)¶ A representation of a trained Random Forest regression model.
-
model
: leapyear._tidl.protocol.algorithms.decisiontree.DecisionForest¶ Alias for field number 2
-
predict
(xs)¶ Prediction function of the random forest regression model.
For each of the data points provided, returns the prediction that the model assigns.
- Parameters
xs (
ndarray
) – array with input data- Returns
array of predictions assigned by the model to input data points
- Return type
np.ndarray
-
classmethod
from_dict
(cls, d)¶ Convert from a dictionary.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
-
to_shap
()¶ Convert the trained model to SHAP format.
The converted model can then be used to construct a TreeExplainer object able to generate Shapley explanations for new records to which the model would be applied to.
Note that:
model execution and generation of model score explanations is expected to be done in a production setting by an automated system with direct access to record-level information.
feature explanations for categorical features are currently not supported. Consider one-hot encoding features to get the benefits of explainable model scores.
Examples
>>> import shap >>> from leapyear import analytics as la >>> ... >>> rf_model = la.regression_trees(xs, y, ds).run() >>> rf_explainer = shap.TreeExplainer(rf_model.to_shap(), X_reference) >>> rf_shap_values = rf_explainer.shap_values(X_to_predict) >>> ...
In this example:
The input X_reference used to initialize the explainer object is a pandas.DataFrame containing explanatory variables in the same order as used to train models. It is used to infer what model scores and feature distribution should be considered “typical”.
The input X_to_predict is a pandas.DataFrame capturing the explanatory variables in the same order as used to train models.
See the SHAP Tree documentation for more information.
LeapYear has been tested with SHAP version 0.39.0. Older or newer versions are not guaranteed to work.
- Return type
-
-
class
leapyear.model.
GradientBoostedTreeClassifier
(max_depth: int, model: WeightedDecisionForest)¶ A representation of a trained gradient boosted tree classifier model.
This includes two named fields:
max_depth
- the maximum depth of the individual decision trees.model
- a model object of classWeightedDecisionForest
, including information about individual decision trees and their weights.
-
model
: leapyear._tidl.protocol.algorithms.decisiontree.WeightedDecisionForest¶ Alias for field number 1
-
predict
(xs)¶ Prediction function of the gradient boosted tree classification model.
For classification problems, returns the most likely class according to the model.
- Parameters
xs (
ndarray
) – array with input data- Returns
array of most likely outcome labels assigned by the model
- Return type
np.ndarray
-
predict_proba
(xs)¶ Prediction probability function of the GBT model.
For each of the data points provided, returns probability that the model assigns to any given outcome.
- Parameters
xs (
ndarray
) – array with input data- Returns
array of probability scores assigned by the model to input data points and possible outcomes
- Return type
np.ndarray
-
predict_log_proba
(xs)¶ Logarithm of probabilities given by GBT model.
For each of the data points provided, returns natural logarithm of probability that the model assigns to any given outcome.
- Parameters
xs (
ndarray
) – array with input data- Returns
array of log-probability scores assigned by the model to input data points and possible outcomes
- Return type
np.ndarray
-
classmethod
from_dict
(cls, d)¶ Convert from a dictionary.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
-
to_shap
()¶ Convert the trained model to SHAP format.
The converted model can then be used to construct a TreeExplainer object able to generate Shapley explanations for new records to which the model would be applied to.
Note that:
model execution and generation of model score explanations is expected to be done in a production setting by an automated system with direct access to record-level information.
feature explanations for categorical features are currently not supported. Consider one-hot encoding features to get the benefits of explainable model scores.
Examples
>>> import shap >>> from leapyear import analytics as la >>> ... >>> gbt_model = la.gradient_boosted_tree_classifier(xs, y, ds).run() >>> gbt_explainer = shap.TreeExplainer(gbt_model.to_shap(), X_reference) >>> gbt_shap_values = gbt_explainer.shap_values(X_to_predict) >>> ...
In this example:
The input X_reference used to initialize the explainer object is a pandas.DataFrame containing explanatory variables in the same order as used to train models. It is used to infer what model scores and feature distribution should be considered “typical”.
The input X_to_predict is a pandas.DataFrame capturing the explanatory variables in the same order as used to train models.
See the SHAP Tree documentation for more information.
LeapYear has been tested with SHAP version 0.39.0. Older or newer versions are not guaranteed to work.
- Return type
Clustering Models¶
-
class
leapyear.model.
ClusterModel
(niters: int, nclusters: int, model: _CM)¶ A representation of the trained K-means clustering model.
This model is generated by running a K-means clustering algorithm
leapyear.analytics.kmeans()
and contains cluster centroids (centers).-
model
: ClusterModel¶ Alias for field number 2
-
predict
(xs)¶ Prediction function of the clustering model.
Returns the labels for each point in xs.
- Parameters
xs (
ndarray
) – A 2-dimensional array of data points.- Returns
The associated cluster labels predicted by the the model.
- Return type
np.ndarray
-
classmethod
from_dict
(cls, d)¶ Convert from a dictionary.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
-
Model Evaluation Objects¶
-
class
leapyear.model.
ConfusionCurve
(model: _CC)¶ The Confusion curve object.
This model is generated from running
leapyear.analytics.roc()
and contains the metrics of true positive, false positive, true negative and false negative rates for a sequence of thresholds. Other common metrics are provided as properties of this model.-
model
: leapyear._tidl.protocol.algorithms.confusioncurve.ConfusionCurve¶ Alias for field number 0
-
property
df
¶ Return a dataframe containing most of the analytics.
- Return type
DataFrame
-
property
thresholds
¶ Thresholds.
Outputs the list of thresholds used for generating confusion curve.
- Return type
ndarray
-
property
tpr
¶ Compute true positive rates.
Outputs a list of true positive rate (sensitivity, recall) values, associated with chosen thresholds.
Aliases:
tpr
(true positive rate),sensitivity
,recall
See also
- Return type
ndarray
-
property
sensitivity
¶ Compute true positive rates.
Outputs a list of true positive rate (sensitivity, recall) values, associated with chosen thresholds.
Aliases:
tpr
(true positive rate),sensitivity
,recall
See also
- Return type
ndarray
-
property
recall
¶ Compute true positive rates.
Outputs a list of true positive rate (sensitivity, recall) values, associated with chosen thresholds.
Aliases:
tpr
(true positive rate),sensitivity
,recall
See also
- Return type
ndarray
-
property
fpr
¶ Compute false positive rates.
Outputs a list of false positive rate (fallout) values, associated with chosen thresholds.
Aliases:
fpr
(false positive rate),fallout
See also
- Return type
ndarray
-
property
fallout
¶ Compute false positive rates.
Outputs a list of false positive rate (fallout) values, associated with chosen thresholds.
Aliases:
fpr
(false positive rate),fallout
See also
- Return type
ndarray
-
property
tnr
¶ Compute true negative rates.
Outputs a list of true negative rate (specificity) values, associated with chosen thresholds.
Aliases:
tnr
(true negative rate),specificity
See also
- Return type
ndarray
-
property
specificity
¶ Compute true negative rates.
Outputs a list of true negative rate (specificity) values, associated with chosen thresholds.
Aliases:
tnr
(true negative rate),specificity
See also
- Return type
ndarray
-
property
fnr
¶ Compute false negative rates.
Outputs a list of false negative rate (miss rate) values, associated with chosen thresholds.
Aliases:
fnr
(false negative rate),missrate
See also
- Return type
ndarray
-
property
missrate
¶ Compute false negative rates.
Outputs a list of false negative rate (miss rate) values, associated with chosen thresholds.
Aliases:
fnr
(false negative rate),missrate
See also
- Return type
ndarray
-
property
precision
¶ Compute precision.
Aliases:
precision
,ppv
(positive predictive value)See also
- Return type
ndarray
-
property
ppv
¶ Compute precision.
Aliases:
precision
,ppv
(positive predictive value)See also
- Return type
ndarray
-
property
npv
¶ Negative predictive value.
- Return type
ndarray
-
property
accuracy
¶ Accuracy.
- Return type
ndarray
-
property
f1score
¶ F1-score.
- Return type
ndarray
-
property
mcc
¶ Matthews correlation coefficient.
- Return type
ndarray
-
property
auc_roc
¶ Area under the ROC curve.
Calculates the area under Receiver Operating Characteristic (ROC) curve.
- Return type
ndarray
-
property
auc_pr
¶ Area under the Precision-Recall curve.
- Return type
ndarray
-
property
gmeasure
¶ the geometric mean of the precision and recall.
- Type
G-measure
- Return type
ndarray
-
Module leapyear.exceptions¶
All errors and exceptions.
-
exception
leapyear.exceptions.
ClientError
¶ General client errors.
-
static
from_response
(response, errors=None)¶ Create a ClientError from a response.
- Return type
-
static
-
class
leapyear.exceptions.
ErrorInfo
(message: str, details: Optional[str] = None)¶ Information about a particular error.
-
exception
leapyear.exceptions.
APIError
(message, *, response=None, errors=None)¶ Error from the API Server.
-
exception
leapyear.exceptions.
DataSetTooSmallException
(**kwargs)¶ Re-throw of DataSetTooSmall exception.
This exception is triggered and interrupts the computation when the LeapYear system is unable to guarantee a result that meets the target accuracy while staying within the maximum Privacy Exposure allowed for the computation.
The primary cause for this exception is that the data sample being analyzed is too small, and therefore providing an accurate result would reveal too much information about a single record.
-
exception
leapyear.exceptions.
DataSetSizeBlockedByPrivacyProfileException
(**kwargs)¶ Re-throw of DataSetSizeBlockedByProfile exception.
This exception is triggered and interrupts the computation when the LeapYear system detects that the data set size is smaller than the minimum size required by the privacy profile.
The primary cause for this exception is that the data sample being analyzed is too small, and therefore providing an accurate result would reveal too much information about a single record.
-
exception
leapyear.exceptions.
HighPrivacyExposureException
(**kwargs)¶ Re-throw of HighPrivacyExposure exception.
This exception occurs when it is estimated that the Privacy Exposure will reach or exceed the Privacy Exposure Limit for that particular computation. This could be due to either:
the data sample is too small and LeapYear estimates that a large privacy exposure will be incurred, or
the computation involves a relation that generates a large sensitivity multiplier.
The precise reason for the exception is not made explicit because of potential privacy concerns.
-
exception
leapyear.exceptions.
SensitivityMultiplierTooLargeException
(logbase10_sensitivity_multiplier, **kwargs)¶ Re-throw of SensitivityMultiplierTooLarge exception.
This exception is triggered and interrupts a computation when the LeapYear system estimates that the computation:
will reach or exceed the maximum Privacy Exposure allowed for the computation; AND
is based on a derived dataset that generates an excessively large sensitivity multiplier.
-
exception
leapyear.exceptions.
InvalidURL
¶ URL can not be parsed.
-
exception
leapyear.exceptions.
TLSError
¶ Errors from TLS.
-
exception
leapyear.exceptions.
InvalidJson
¶ Json was not constructed correctly.
-
exception
leapyear.exceptions.
ConnectionError
¶ Error establishing connection.
-
exception
leapyear.exceptions.
ServerVersionMismatch
(server_version=None)¶ Server and client have different versions.
-
exception
leapyear.exceptions.
TokenExpiredError
¶ Authentication token was expired.
-
exception
leapyear.exceptions.
AsyncTimeoutError
¶ Async job timed out.
-
exception
leapyear.exceptions.
AsyncCancelledError
¶ Async job was cancelled.
-
exception
leapyear.exceptions.
GroupbyAggTooManyKeysError
¶ Too many keys for a GroupbyAgg operation.
-
exception
leapyear.exceptions.
LoadUnsupportedModelException
¶ This exception is raised when trying to load or save an unsupported type of model.
-
exception
leapyear.exceptions.
LoadModelMismatchException
(actual_type, expected_type)¶ This exception is raised when there is a mismatch with loaded and expected models.
-
exception
leapyear.exceptions.
LoadModelVersionException
(model, name_file)¶ This exception is raised when loading a module with a version not supported.
-
exception
leapyear.exceptions.
SaveUnsupportedModelException
¶ This exception is raised when trying to save an unsupported model.
-
exception
leapyear.exceptions.
PublicKeyCredentialsError
(message)¶ Exceptions originating from problems with public key auth credentials.
Module leapyear.ext¶
Quick client¶
-
leapyear.ext.user.
client
(config_file=PosixPath('~/.leapyear_client.ini'), debug=False, **kwargs)¶ Use environment variables or a configuration file to quickly connect to LeapYear.
This function uses values found in environment variables, a config file, keyword arguments and built-in defaults to try to establish a connection to a LeapYear server. In order of precedence, values are taken from the
kwargs
of this function, then environment variable, the config file, and finally, default values, if they exist.By default, the config file is
~/.leapyear_client.ini
. The config file requires a[leapyear.io]
section where values can be found.The following table contains the names of the values when specified by particular methods, and the default values if no other value can be determined.
environment variable
ini key/keyword argument
default value
LY_URL
url
'http://localhost:4401'
LY_USERNAME
username
None
LY_PASSWORD
password
None
LY_DEFAULT_ANALYSIS_CACHING
default_analysis_caching
True
LY_DEFAULT_ALLOW_MAX_BUDGET_ALLOCATION
default_allow_max_budget_allocation
True
LY_LOGGING_LEVEL
logging_level
'NOTSET'
At least
username
andpassword
must be supplied to establish a connection to the LeapYear server.logging_level
should be the name of a logging level inlogging
.Example
Contents of
~/.leapyear_client.ini
are[leapyear.io] username = alice password = lihjAgsd324$ url = http://api.leapyear.domain.com:4401
Next, we execute a basic test with
debug=True
to see the values that are passed to theClient
constructor.>>> from leapyear.ext.user import client >>> import logging >>> logging.basicConfig() >>> c = client(debug=True) DEBUG:leapyear.ext.user:Found config file with [leapyear.io] section. DEBUG:leapyear.ext.user:Resolved the following values: DEBUG:leapyear.ext.user: url <- <str: 'http://api.leapyear.domain.com:4401'> DEBUG:leapyear.ext.user: username <- <str: 'alice'> DEBUG:leapyear.ext.user: password <- <str: 'lihjAgsd324$'> DEBUG:leapyear.ext.user: default_analysis_caching <- <bool: True> DEBUG:leapyear.ext.user: default_allow_max_budget_allocation <- <bool: True> DEBUG:leapyear.ext.user: logging_level <- <int: 0> >>> print(c.connected) True >>> c.close()
- Parameters
config_file (pathlib.Path) – Specify an alternate config file.
debug (bool) – Set to
True
to enable extra debugging information.basicConfig()
may be useful to run so that debugging information will print to stdout.
- Return type