Getting Started¶

Connecting to LeapYear and Exploring¶

The first step to using LeapYear’s data security platform for analysis is getting connected. To get started, we’ll import the Client object from the leapyear python library and connect to the LeapYear server using our user credentials.

Credentials used for this tutorial:

>>> url = 'http://localhost:{}'.format(os.environ.get('LY_PORT', 4401))
>>> username = 'tutorial_user'
>>> password = 'abcdefghiXYZ1!'

Import the Client object:

>>> from leapyear import Client

Create a connection:

>>> client = Client(url, username, password)
>>> client.connected
True
>>> client.close()
>>> client.connected
False

Alternatively, Client is also a context manager, so the connection is automatically closed at the end of a with block:

>>> with Client(url, username, password) as client:
...     # carry out computations with connection to LeapYear
...     client.connected
True
>>> client.connected
False

Databases, Tables and Columns¶

Once we’ve obtained a connection to LeapYear, we can look through the databases and tables that are available for data analysis:

>>> client = Client(url, username, password)

Examine databases available to the user:

>>> client.databases.keys()
dict_keys(['tutorial'])
>>> tutorial_db = client.databases['tutorial']
>>> tutorial_db
<Database tutorial>

Examine tables within the database tutorial:

>>> sorted(tutorial_db.tables.keys())  
['classification',
 'regression1',
 'regression2',
 'twoclass']
>>> example1 = tutorial_db.tables['regression1']
>>> example1
<Table tutorial.regression1>

Examine the columns on table tutorial_db.regression1:

>>> example1.columns  
{'x0': <TableColumn tutorial.regression1.x0: type='REAL' bounds=(-4.0, 4.0)     nullable=False>,
 'x1': <TableColumn tutorial.regression1.x1: type='REAL' bounds=(-4.0, 4.0)     nullable=False>,
 'x2': <TableColumn tutorial.regression1.x2: type='REAL' bounds=(-4.0, 4.0)     nullable=False>,
 'y':  <TableColumn tutorial.regression1.y:  type='REAL' bounds=(-400.0, 400.0) nullable=False>}

Column Types¶

TableColumn objects include their type, bounds, and nullability.

>>> col_x0 = example1.columns['x0']
>>> col_x0.type
<ColumnType.REAL: 'REAL'>
>>> col_x0.bounds
(-4.0, 4.0)
>>> col_x0.nullable
False

The possible types are: BOOL, INT, REAL, FACTOR, DATE, TEXT, and DATETIME.

INT, REAL, DATE, and DATETIME have publicly available bounds, representing the lower and upper limits of the data in the column. FACTOR also has bounds, representing the set of strings available in the column. BOOL and TEXT columns have no bounds.

The DataSet Class¶

Once we’ve established a connection to the LeapYear server using the Client class, we can import the DataSet to access and analyze tables.

>>> from leapyear import DataSet

We can access tables, either using the client interface as above:

>>> ds_example1 = DataSet.from_table(example1)

or by directly referencing the table by name:

>>> ds_example1 = DataSet.from_table('tutorial.regression1')

The DataSet class is the primary way of interacting with data in the LeapYear system. A DataSet is associated with collection of Attributes, which can be used to compute statistics. The DataSet class allows the user to manipulate and analyze the attributes of a data source using a variety of relational operations such as column selection, row selection based on conditions, unions, joins, etc.

An instance of the Attribute class represents either an individual named column in the DataSet or a transformation of one or several of such columns via supported operations. Attributes also have types, which can be inspected the same as the types in a DataSet schema. Attributes can be manipulated using most built in Python operations, such as +, *, and abs.

>>> ds_example1.schema  
Schema([('x0', AttributeType(name='REAL', nullable=False, domain=(-4, 4))),
        ('x1', AttributeType(name='REAL', nullable=False, domain=(-4, 4))),
        ('x2', AttributeType(name='REAL', nullable=False, domain=(-4, 4))),
        ('y',  AttributeType(name='REAL', nullable=False, domain=(-400, 400)))])
>>> ds_example1.schema['x0']
AttributeType(name='REAL', nullable=False, domain=(-4, 4))
>>> ds_example1.schema['x0'].name
'REAL'
>>> ds_example1.schema['x0'].nullable
False
>>> ds_example1.schema['x0'].domain
(-4.0, 4.0)
>>> attr_x0 = ds_example1['x0']
>>> attr_x0
<Attribute: x0>
>>> attr_x0 + 4
<Attribute: x0 + 4>
>>> attr_x0.type
AttributeType(name='REAL', nullable=False, domain=(-4, 4))
>>> attr_x0.type.name
'REAL'
>>> attr_x0.type.nullable
False
>>> attr_x0.type.domain
(-4.0, 4.0)

In the following example, we’ll take a few attributes from the table tutorial.regression1, adding one to the x1 attribute and multiplying x2 by three. The bounds are altered to reflect the change.

>>> ds1 = ds_example1.map_attributes(
...     {'x1': lambda att: att + 1.0, 'x2': lambda att: att * 3.0}
... )
>>> ds1.schema  
Schema([('x0', AttributeType(name='REAL', nullable=False, domain=(-4, 4))),
        ('x1', AttributeType(name='REAL', nullable=False, domain=(-3, 5))),
        ('x2', AttributeType(name='REAL', nullable=False, domain=(-12, 12))),
        ('y',  AttributeType(name='REAL', nullable=False, domain=(-400, 400)))])

We can use DataSet to filter the data to examine subsets of the data, e.g. by applying predicates to the data:

>>> ds2 = ds_example1.where(ds_example1['x1'] > 1)
>>> ds2.schema  
Schema([('x0', AttributeType(name='REAL', nullable=False, domain=(-4, 4))),
        ('x1', AttributeType(name='REAL', nullable=False, domain=(1, 4))),
        ('x2', AttributeType(name='REAL', nullable=False, domain=(-4, 4))),
        ('y',  AttributeType(name='REAL', nullable=False, domain=(-400, 400)))])

Data Analysis¶

Statistics¶

The LeapYear system is designed to allow access to various statistical functions and develop machine learning models based on data in DataSet. The analytics function is not executed until the run() method is called on it. This allows inspection of the overall workflow and early reporting of errors. All analysis functions are located in the leapyear.analytics module.

>>> import leapyear.analytics as analytics

Many common statistics functions are available including:

Next is an example of obtaining simple statistics from the dataset:

>>> mean_analysis = analytics.mean('x0', ds_example1)
>>> mean_analysis.run()
0.039159280186637294
>>> variance_analysis = analytics.variance('x0', ds_example1)
>>> variance_analysis.run()
1.0477940098374177
>>> quantile_analysis = analytics.quantile(0.25, 'x0', ds_example1)
>>> quantile_analysis.run()
-0.6575000000000001

By combining statistics with the ability to transform and filter data, we can look at various statistics associated to subsets of the data:

>>> analytics.mean('x0', ds_example1).run()
0.039159280186637294
>>> ds2 = ds_example1.where(ds_example1['x1'] > 1)
>>> analytics.mean('x0', ds2).run()
0.14454229785771325

Machine Learning¶

The leapyear.analytics module also supports various machine learning (ML) models, including

regression-based models (linear, logistic, generalized),
tree-based models (random forests for classification and regression tasks),
unsupervised models (e.g. K-means, PCA),
the ability do optimize model hyperparemeters via search with cross-validation, and
the ability to evaluate model performance based on a variety of common validation metrics.

In this section we will share some examples of the machine learning tools provided by the LeapYear system.

The Effect of L2 Regularization on Model Coefficients¶

The following example code shows a common theoretical result from ML: as the L2 regularization parameter alpha increases, we see the coefficients of the model gradually approach zero. This is depicted in the graph generated below:

>>> n_alphas = 20
>>> alphas = np.logspace(-2,2, n_alphas)
>>>
>>> # example3 has 0 and 1 in the y column. Here, we convert 1 to True and 0 to False
>>> ds_example3 = DataSet\
...    .from_table('tutorial.classification')\
...    .map_attribute('y', lambda att: att.decode({1: True}).coalesce(False))
>>>
>>> models = []
>>> for alpha in alphas:
...     model = analytics.generalized_logreg(
...         ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9'],
...         'y',
...         ds_example3,
...         affine=False,
...         l1reg=0.001,
...         l2reg=alpha
...     ).run()
...     models.append(model)
>>>
>>> coefs = np.array([np.append(m.coefficients, m.intercept) for m in models]).reshape((n_alphas,11))

Plotting the coefficients with respect to alpha values:

>>> import matplotlib.pyplot as plt
>>> plt.figure()
>>> plt.plot(alphas, coefs)
>>> plt.xscale('log')
>>> plt.xlabel('alpha')
>>> plt.ylabel('weights')
>>> plt.title('coefficients as a function of the regularization')
>>> plt.axis('tight')
>>> plt.show()

Training a Simple Logistic Regression Model¶

This example shows how to compute a logistic regression classifier and evaluate it’s performance using the receiver operating characteristic (ROC) curve.

>>> ds_train = ds_example3.split(0, [80, 20])
>>> ds_test = ds_example3.split(1, [80, 20])
>>> glm = analytics.generalized_logreg(['x1'], 'y', ds_train, affine=True, l1reg=0, l2reg=0.01).run()
>>> cc = analytics.roc(glm, ['x1'], 'y', ds_test, thresholds=32).run()

Plot the ROC and display the area under the ROC:

>>> plt.figure()
>>> plt.plot(cc.fpr, cc.tpr, label='ROC curve (area = %0.2f)' % cc.auc_roc)
>>> plt.plot([0, 1], [0, 1], 'k--')
>>> plt.xlabel('False Positive Rate')
>>> plt.ylabel('True Positive Rate')
>>> plt.title('Receiver operating characteristic example')
>>> plt.legend(loc="lower right")
>>> plt.show()

Training a Random Forest¶

In this example we train a random forest classifier on a binary classification problem associated to two overlapping gaussian distributions centered at (0,0) and (3,3). Points around (0,0) are labeled as in the negative class while points around (3,3) are labeled as in the positive class.

>>> ds_example4 = DataSet.from_table('tutorial.twoclass')
>>> rf = analytics.random_forest(['x1', 'x2'], 'y', ds_example4, 100, 1).run()

>>> plot_colors = "br"
>>> plot_step = 0.1
>>>
>>> x_min, x_max = 1.5-8, 1.5+8
>>> y_min, y_max = 1.5-8, 1.5+8
>>> xx, yy = np.meshgrid(
...     np.arange(x_min, x_max, plot_step),
...     np.arange(y_min, y_max, plot_step)
... )
>>> Z = rf.predict(np.c_[xx.ravel(), yy.ravel()])
>>> Z = Z.reshape(xx.shape)

Plot the decision boundary:

>>> fig, ax = plt.subplots()
>>> plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
>>> # Draw circles centered at the gaussian distributions
>>> ax.add_artist(plt.Circle((0,0), 1.5, color='k', fill=False))
>>> ax.add_artist(plt.Circle((3,3), 1.5, color='k', fill=False))
>>> ax.text(3, 3, '+')
>>> ax.text(0, 0, '-')
>>> plt.xlabel('x')
>>> plt.ylabel('y')
>>> plt.title('Decision Boundary')

This concludes the user tutorial section, so the connection should be closed.

>>> client.close()
>>> client.connected
False

Management and Administration¶

Administration tasks use the Client class from the leapyear module and admin classes from the leapyear.admin. These admin classes include:

These classes provide API’s for various administrator tasks on the LeapYear system. All of the examples in the administrative examples section will require correct permissions.

Managing the LeapYear Server¶

Management requires sufficient privileges. The examples below assume the lyadmin user is an administrator of the LeapYear deployment system.

>>> client = Client(url, 'lyadmin', ROOT_PASSWORD)
>>> client.connected
True

User Management¶

User objects are used as the primary API for managing users. Below is an example of a user being created, their password updated, and finally their account is disabled.

>>> # Create the user
>>> user = User('new_user', password)
>>> client.create(user)
>>> 'new_user' in client.users
True
>>>
>>> # Update the user's password
>>> new_password = '{}100'.format(password)
>>> user.update(password=new_password)
<User new_user>
>>>
>>> # Disable the user
>>> user.enabled
True
>>>
>>> user.enabled = False
>>> user.enabled
False

Database Management¶

Database objects are used to view and manipulate databases on the server.

>>> # create database
>>> client.create(Database('sales'))
>>>
>>> # retrieve a reference to the database
>>> sales_database = client.databases['sales']
>>>
>>> # drop database
>>> client.drop(sales_database)

Table Management¶

Table objects are used to view and manipulate tables in a database on the server. Below is an example of how to define a data source (table) object on the LeapYear server.

>>> credentials = 'hdfs:///path/to/data.parquet'
>>>
>>> # create a table
>>> accounts = Database('accounts')
>>> table = Table('users', credentials=credentials, database=accounts)
>>>
>>> client.create(accounts)
>>> client.create(table)
>>>
>>> # retrieve a reference to the table
>>> users_table = accounts.tables['users']
>>>
>>> # drop a table
>>> client.drop(users_table)