Module leapyear.dataset¶
DataSet and Attribute.
DataSets combine a data source (Table) with data transformations which subsequent computations can be performed on. A data set has a schema which describes the types of attributes (columns) in the data source. Attributes can be manipulated as if they were built-in python types (int, float, bool, …). DataSet also provides the following lazy methods for transforming data:
project()
: select one or more attributes (columns) from the data source.
where()
: select rows that satisfy a filter condition.
union()
: combine two data sets with matching schemas.
split()
: create subsets whose size is a fraction of the total data size.
splits()
: yield all fractional partitions of the data set.
stratified()
: create subsets of the data with fixed attribute prevalence.
group_by()
: create aggregated views of the data set.
join()
: combine rows of two datasets where certain data elements match.
transform()
: apply a linear transformation to the specified data elements.
Further details for each transformation can be found in their respective documentation below. Transformations are lazy in the sense that they are not evaluated until a computation is executed; however, each transformation requires DataSet schema to be re-evaluated, which relies on a live connection to the LeapYear server.
Computations and machine learning analytics that process the data set are found
in leapyear.analytics
.
The examples below rely on a connection to LeapYear server, which can be established as follows:
>>> from leapyear import Client
>>> client = Client(url='http://ly-server:4401', username='admin', password='password')
Examples
Load a dataset and examine its schema:
>>> pds = DataSet.from_table('db.table')
>>> pds.schema
OrderedDict([
('attr1', Type(tag=BOOL, contents=())),
('attr2', Type(tag=INT, contents=(0, 20))),
('attr3', Type(tag=REAL, contents=(-1, 1))),
])
>>> pds.attributes
['attr1', 'attr2', 'attr3']
Create a new attribute and find its type:
>>> pds['attr2_gt_10'] = pds['attr2'] > 10
>>> pds.attributes
['attr1', 'attr2', 'attr3', 'attr2_gt_10']
>>> pds.schema['attr2_gt_10']
Type(tag=BOOL, contents=())
Select instances using a predicate:
>>> pds_positive_attr3 = pds.where(pds['attr3'] > 0)
>>> pds_positive_attr3.schema['attr3']
Type(tag=REAL, contents=(0, 1))
Calculate the mean of a single attribute:
>>> import leapyear.analytics as analytics
>>> mean_analysis = analytics.mean(pds_positive_attr3['attr3'])
>>> mean_analysis
computation: MEAN(attr3 > 0)
attributes:
attr3: db.table.attr3 (0 <= x <= 1)
>>> mean_analysis.run() # run the computation on the LeapYear server.
0.02
DataSet class¶
-
class
leapyear.dataset.
DataSet
(relation)¶ DataSet object.
-
property
relation
¶ Get the DataSet relation.
- Return type
Relation
-
property
schema
¶ Get the DataSet schema, including data types, values allowed.
- Return type
Schema
-
classmethod
from_view
(cls, view)¶ Create a dataset from a view.
-
classmethod
from_table
(table, *, slices=None, all_slices=False)¶ Create a dataset from a table.
- Parameters
table – The table or the name of the table on the LeapYear server.
slices – A list of ranges of table slices to use.
all_slices – Set to True to use all slices in the Table at the time this is run. Note that if a new slice has been added, this will create a DataSet with the new slice, causing analyses to miss the analysis cache when rerunning.
- Returns
The dataset with a LeapYear table as its source.
- Return type
-
get_attribute
(key)¶ Select one attribute from the data set.
-
drop_attributes
(keys)¶ Drop Attributes.
-
drop_attribute
(key)¶ Drop an attribute.
-
with_attributes
(name_attrs)¶ Return a new DataSet with additional attributes.
-
with_attribute
(name, attr)¶ Return a new DataSet with an additional attribute.
-
with_attributes_renamed
(rename)¶ Rename attributes using a mapping.
- Parameters
rename (
Mapping
[str
,str
]) – Dictionary mapping old names to new names.- Returns
New DataSet with renamed attributes.
- Return type
Examples
1. Create new dataset ds2 with renamed columns ‘ZipCode’ and ‘Name’ which were named ‘zip_code’ and ‘name’ respectively. Similarly, rename attributes in another dataset ds3. Then, finally these datasets can be merged using union:
>>> ds2 = ds1.with_attributes_renamed({'zip_code':'ZipCode','name':'Name'}) >>> ds4 = ds3.with_attributes_renamed({'zipcode':'ZipCode','name_str':'Name'}) >>> ds4 = ds4.union(ds2)
-
with_attribute_renamed
(old_name, new_name)¶ Rename an attribute.
-
map_attributes
(name_lambdas)¶ Map attributes and create a new DataSet.
-
map_attribute
(name, func)¶ Map an attribute and create a new DataSet.
-
project
(new_keys)¶ Select multiple attributes from the data set.
Projection (π): Creates a new DataSet with only selected attributes from this DataSet.
-
select
(*attrs)¶ Create a new DataSet by selecting column names or Attributes.
-
select_as
(mapping)¶ Create a new DataSet by mapping from column names or Attributes to new names.
-
where
(clause)¶ Select/filter rows using a filter expression.
Selection (σ): LeapYear’s where clause creates a new DataSet based on the filter condition. Its schema may include smaller domain of possible values.
- Parameters
clause (
Attribute
) – A filter Attribute: a single attribute of type BOOL.- Returns
A filtered DataSet.
- Return type
-
union
(other, distinct=False)¶ Union or concatenate datasets.
Union (∪): Concatenates data sets with matching schema.
- Parameters
other (
DataSet
) –The dataset to union with. Note: schema of other must match that of self, including the order of the attributes. The order of attributes can be aligned like so:
>>> ds2 = ds2[list(ds1.schema.keys())] >>> ds_union = ds1.union(ds2)
distinct (
bool
) – Remove duplicate rows.
- Returns
A combined dataset.
- Return type
-
join
(other, on, right_on=None, join_type='inner', unique_left_keys=False, unique_right_keys=False, left_suffix='', right_suffix='', **kwargs)¶ Combine two DataSets by joining using a key, as in SQL JOIN statement.
If right_k (or left_k) is specified, the right (or left) data set will include no more than k rows for each matching row in the left (or right) data set.
- Parameters
other (
DataSet
) – The other DataSet to join with.on (
Union
[str
,List
[str
]]) – The key(s) to join on. If using suffixes, the suffix must NOT be appended by the caller.right_on (
Union
[str
,List
[str
],None
]) – The key(s) on the right table, if different than those in on. None if both tables have the same key names. If using suffixes, the suffix must NOT be appended by the caller.join_type (
str
) –LeapYear supports a variety of joins listed in the left column in the table below. Any value of
join_type
from the right column will use that particular join. If no value is specified, an inner join will be used. Aleft_outer_public
join can be run only on a right public table.Join
join_type
Inner
"inner"
(default)Outer
"outer"
,"full"
,"full_outer"
, or"fullouter"
Left
"left"
,"left_outer"
, or"leftouter"
Right
"right"
,"right_outer"
, or"rightouter"
Left Antijoin
"left_anti"
or"leftanti"
Left Semijoin
"left_semi"
or"leftsemi"
Left Outer Public
"left_outer_public"
unique_left_keys (
bool
) – If the left DataSet is known to have unique keys, setting this to True will run an optimized join algorithms. Warning: Setting this to True if the keys are not unique will cause data loss!unique_right_keys (
bool
) – If the right DataSet is known to have unique keys, setting this to True will run an optimized join algorithms. Warning: Setting this to True if the keys are not unique will cause some data rows to not show up in the output DataSet!left_suffix (
str
) – Optional suffix to append to the column names of the left DataSet.right_suffix (
str
) – Optional suffix to append to the column names of the right DataSet.-> cache (kwargs) – Optional StorageLevel for persisting intermediate datasets for performance enhancement. The meaning of these values is documented in the Spark RDD programming guide.
- Returns
The joined DataSet.
- Return type
Examples
Joining two datasets on a common key:
>>> ds1 = example_ds1.project(['key', 'col1']) >>> ds2 = example_ds2.project(['key', 'col2']) >>> ds = ds1.join(ds2, 'key') >>> list(ds.schema.keys()) ['key', 'col1', 'col2']
Joining two datasets on a single key but with a different name on the right:
>>> ds1 = example_ds1.project(['key1', 'col1']) >>> ds2 = example_ds2.project(['key2', 'col2']) >>> ds = ds1.join(ds2, 'key1', right_on='key2') >>> list(ds.schema.keys()) ['key1', 'col1', 'key2', 'col2']
Joining when a column is duplicated is an error:
>>> ds1 = example_ds1.project(['key', 'col1', 'col3']) >>> ds2 = example_ds2.project(['key', 'col2', 'col3']) >>> ds = ds1.join(ds2, 'key') APIError: Invalid schema: repeated aliases: col3
4. Joining when the key is missing from one relation: Fails because in this case a right key has to be specified.
>>> ds1 = example_ds1.project(['key1', 'col1']) >>> ds2 = example_ds2.project(['key2', 'col2']) >>> ds = ds1.join(ds2, 'key1') APIError: Error parsing scope, missing variable declaration for `key1`
Joining with multiple keys:
>>> ds1 = example_ds1.project(['key1_1', 'key2_1', 'col1']) >>> ds2 = example_ds2.project(['key1_2', 'key2_2', 'col2']) >>> ds = ds1.join(ds2, ['key1_1', 'key2_1'], right_on=['key1_2', 'key2_2']) >>> list(ds.schema.keys()) ['key1_1', 'key2_1', 'col1', 'key1_2', 'key2_2', 'col2']
Joining with different number of keys results in an error:
>>> ds1 = example_ds1.project(['key1_1', 'key2_1', 'col1']) >>> ds2 = example_ds2.project(['key1_2', 'key2_2', 'col2']) >>> ds = ds1.join(ds2, ['key1_1', 'key2_1'], right_on='key1_2') APIError: Invalid schema: join key length mismatch ...
Joining with specifying that the keys are unique:
>>> ds1 = example_ds1.project(['key1', 'col1']) >>> ds2 = example_ds2.project(['key1', 'col2']) >>> ds = ds1.join(ds2, 'key1', unique_left_keys=True, unique_right_keys=True) >>> list(ds.schema.keys()) ['key1', 'col1', 'col2']
Different join types:
>>> ds1 = example_ds1.project(['key1', 'col1']) >>> ds2 = example_ds2.project(['key1', 'col2']) >>> ds = ds1.join(ds2, 'key1', join_type='outer') >>> list(ds.schema.keys()) ['key1', 'col1', 'col2']
Left semi join:
>>> ds1 = example_ds1.project(['key1', 'col1']) >>> ds2 = example_ds2.project(['key1', 'col2']) >>> ds = ds1.join(ds2, 'key1', join_type='left_semi') >>> list(ds.schema.keys()) ['key1', 'col1']
Joining when the nullability of keys is different:
>>> ds1 = example_ds1.select([col('key1').decode({0: 0}).alias('nkey1'), 'col1']) >>> ds2 = example_ds2.project(['key2', 'col2']) >>> ds = ds1.join(ds2, 'nkey', right_on='key2') >>> list(ds.schema.keys()) ['nkey1', 'col1', 'key2', 'col2']
Joining when keys have different but coercible types:
>>> realKey1 = col('intKey1').as_real().alias('realKey1') >>> ds1 = example_ds1.select([realKey1, 'col1']) >>> ds2 = example_ds2.project(['intKey2', 'col2']) >>> ds = ds1.join(ds2, 'realKey1', right_on='intKey2') >>> list(ds.schema.keys()) ['intKey1', 'col1', 'intKey2', 'col2']
Joining when keys are factors (upcasted to common type):
>>> keyFactor1 = col('key1').decode({k: 'A' for k in range(10)}). >>> as_factor().alias('keyFactor1') >>> keyFactor2 = col('key2').decode({k: 'B' for k in range(10)}). >>> as_factor().alias('keyFactor2') >>> ds1 = example_ds1.select([keyFactor1, 'col1']) >>> ds2 = example_ds2.select([keyFactor2, 'col2']) >>> ds = ds1.join(ds2, 'keyFactor1', right_on='keyFactor2') >>> list(ds.schema.keys()) ['key1', 'col1', 'key2', 'col2']
Joining when the keys have mismatched types is an error (e.g. factor and bool):
>>> keyFactor1 = col('key1').as_factor().alias('keyFactor1') >>> ds1 = example_ds1.project(['key1', 'col1']) >>> ds2 = example_ds2.project(['key2', 'col2']) >>> ds = ds1.join(ds3, 'keyFactor1', right_on='key2') APIError: Invalid schema: join column type mismatch ...
Joining with suffixes to disambiguate column names:
>>> ds1 = example_ds1.project(['key', 'col1']) >>> ds2 = example_ds2.project(['key', 'col2']) >>> ds = ds1.join(ds2, 'key', left_suffix='_l', right_suffix='_r') >>> list(ds.schema.keys()) ['key_l', 'col1_l', 'key_r', 'col2_r']
Joining with only left suffixes to disambiguate column names:
>>> ds1 = example_ds1.project(['key', 'col1']) >>> ds2 = example_ds2.project(['key', 'col2']) >>> ds = ds1.join(ds2, 'key', left_suffix='_l') >>> list(ds.schema.keys()) ['key_l', 'col1_l', 'key', 'col2']
Joining with only right suffixes to disambiguate column names:
>>> ds1 = example_ds1.project(['key', 'col1']) >>> ds2 = example_ds2.project(['key', 'col2']) >>> ds = ds1.join(ds2, 'key', right_suffix='_r') >>> list(ds.schema.keys()) ['key', 'col1', 'key_r', 'col2_r']
Joining with specific cache level for intermediate caches:
>>> ds = ds1.join(ds2, on="key", cache=StorageLevel.DISK_ONLY, n_partitions=1)
- Unsupported Backends
Conditional support for the following LeapYear compute backend(s): snowflake — All functionality available except for the ‘cache’ keyword arg.
-
prepare_join
(join_on, k, n_partitions)¶ Prepare DataSet for join.
- Parameters
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
DataSet
-
unpersist_join_cache
(other, on, right_on=None, join_type='inner', unique_left_keys=False, unique_right_keys=False, left_suffix='', right_suffix='', **kwargs)¶ Unpersist intermediate caches used by .join().
NOTE: use same parameters as join call to unpersist
- Parameters
other (
DataSet
) – The other DataSet to join with.on (
Union
[str
,List
[str
]]) – The key(s) to join on. If using suffixes, the suffix must NOT be appended by the caller.right_on (
Union
[str
,List
[str
],None
]) – The key(s) on the right table, if different than those in on. None if both tables have the same key names. If using suffixes, the suffix must NOT be appended by the caller.join_type (
str
) –LeapYear supports a variety of joins listed in the left column in the table below. Any value of
join_type
from the right column will use that particular join. If no value is specified, an inner join will be used. Aleft_outer_public
join can be run only on a right public table.Join
join_type
Inner
"inner"
(default)Outer
"outer"
,"full"
,"full_outer"
,or
"fullouter"
Left
"left"
,"left_outer"
, or"leftouter"
Right
"right"
,"right_outer"
, or"rightouter"
Left Antijoin
"left_anti"
or"leftanti"
Left Semijoin
"left_semi"
or"leftsemi"
Left Outer Public
"left_outer_public"
unique_left_keys (
bool
) – If the left DataSet is known to have unique keys, setting this to True will run an optimized join algorithms. Warning: Setting this to True if the keys are not unique will cause data loss!unique_right_keys (
bool
) – If the right DataSet is known to have unique keys, setting this to True will run an optimized join algorithms. Warning: Setting this to True if the keys are not unique will cause some data rows to not show up in the output DataSet!left_suffix (
str
) – Optional suffix to append to the column names of the left DataSet.right_suffix (
str
) – Optional suffix to append to the column names of the right DataSet.
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
- Return type
DataSet
-
group_by
(*grouping)¶ Aggregate data by a categorical column(s).
- Parameters
grouping (
Union
[str
,Attribute
]) – The attribute or attributes to group by, separated by comma.- Returns
A new
GroupedData
object with groupings as specified. It can be used with agg function to create a DataSet with derived aggregate attributes, see examples below.- Return type
Examples
1. Group by multiple columns (‘col1’ and ‘col2’) in Dataset ds and aggregate ‘col3’ and ‘col4’:
>>> ds_group = ds.group_by('col1', 'col2').agg((['col3'], 'count'), (['col4'], 'count'))
Group by single column ‘col1’ in Dataset and compute aggregate of ‘col3’ and ‘col4’:
>>> ds_group = ds.group_by('col1').agg((['col3'], 'max'), (['col4'], 'mean'))
-
split
(index, proportions, complement=False)¶ Split and select one partition of the dataset.
Selection (σ): Splits are specified by proportions.
- Parameters
index (
int
) – The split numberproportions (
List
[int
]) – The proportions to split the dataset by, represented as a list of integers. For example, [1,1,2] will split into 3 datasets with 1/4, 1/4 and 1/2 of the data in each, respectively.complement (
bool
) – If True, returns a DataSet which is the complement of the split (e.g. all rows not in the split). Default:False
.
- Returns
The ith-partition of the dataset.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
splits
(proportions)¶ Split the dataset and return an iterator over the resulting partitions.
Selection (σ): Splits are specified by proportions.
- Parameters
proportions (
List
[int
]) – The proportions to split the dataset by, represented as a list of integers. For example, [1,1,2] will split into 3 datasets with 1/4, 1/4 and 1/2 of the data in each, respectively.- Returns
Iterator over the DataSet objects representing partitions.
- Return type
Iterator[DataSet]
Examples
Create 80/20 split of a Dataset ds1:
>>> traintest, holdout = ds1.splits((8,2))
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
stratified_split
(index, proportions, stratified, complement=False)¶ Split by stratifying a categorical attribute.
Selection (σ): Each split will contain approximately the same proportion of values from each category.
As an example, for Boolean stratification, each split will contain the same proportion of True/False values.
- Parameters
index (
int
) – The split numberproportions (
List
[int
]) – The proportions to split the dataset by, represented as a list of integers. For example, [1,1,2] will split into 3 datasets with 1/4, 1/4 and 1/2 of the data in each, respectively.stratified (
str
) – The column to stratify against. Must be Boolean or Factor. Must not be nullable.complement (
bool
) – If True, returns a DataSet which is the complement of the split (e.g. all rows not in the split). Default:False
.
- Returns
The ith-partition of the dataset.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
stratified_splits
(proportions, stratified)¶ Split by stratifying a categorical attribute.
Selection (σ). For boolean stratification, each split will maintain the same proportion of True/False values.
- Parameters
proportions (
List
[int
]) – The proportions to split the dataset by, represented as a list of integers. For example, [1,1,2] will split into 3 datasets with 1/4, 1/4 and 1/2 of the data in each, respectively.stratified (
str
) – The column to stratify against. Must be Boolean or Factor. Must not be nullable.
- Returns
Iterator over the DataSet objects representing partitions.
- Return type
Iterable[DataSet]
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
kfold
(n_folds=3)¶ Split the dataset into train/test pairs using k-fold strategy.
Each fold can then be used as a validation set once while
k-1
remaining folds form the training set. If the dataset has size N, each (train, test) pair will be sized N*(k-1)/k and N/k respectively.
-
stratified_kfold
(stratified, n_folds=3)¶ Split the dataset into train/test pairs using k-fold stratified splits.
Each fold can then be used as a validation set once while
k-1
remaining folds form the training set. If the dataset has size N, each (train, test) pair will be sized N*(k-1)/k and N/k respectively.- Parameters
- Returns
The iterator over the k pairs of (train, test) partitions.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
rows_async
(*, limit=None)¶ Retrieve rows.
If the user has permission to do so, the function returns a generator of OrderedDict objects representing the attribute names and values from each row. The generator requires connection to the server over its entire lifetime.
- Parameters
- Returns
The iterator over the rows of the input dataset, each row being represented as an OrderedDict objects mapping attribute names to their values in this row.
- Return type
Iterator[Mapping[str, Value]]
-
rows
(limit=None, max_timeout_sec=None)¶ Retrieve rows.
Same as
rows_async()
, except waits on the job.
-
rows_pandas
(limit=None)¶ Retrieve rows as a pandas DataFrame.
-
head_pandas
(n=10)¶ Retrieve rows, if the user has permission to do so, see
rows_pandas()
.- Parameters
n (
int
) – Maximum number of rows to output.- Returns
The rows of the input dataset.
- Return type
DataFrame
-
example_rows
()¶ Retrieve 10 rows of example data from the DataSet.
The returned data is based only on the public metadata. The generated data does not depend on the true data at all and should not be used for data inference.
The function requires an active connection to the LeapYear server.
Does not support TEXT attributes - consider dropping them using
drop_attributes()
before runningexample_rows()
.- Returns
The iterator over the rows of the generated dataset; each row is represented as an OrderedDict objects mapping attribute names to their generated values.
- Return type
Iterator[Mapping[str, Any]]
Example
>>> pds = DataSet.from_table('db.table') >>> pds.example_rows()
To turn the data into a pandas DataFrame, use
>>> import pandas >>> df = pandas.DataFrame.from_dict(pds.example_rows())
Alternatively, use
example_rows_pandas()
.
-
example_rows_pandas
()¶ Retrieve 10 rows of example data from the DataSet. See
example_rows()
.- Returns
The example rows.
- Return type
DataFrame
-
transform
(attrs, transformation, name)¶ Apply linear transformation to a list of attributes based on a matrix.
- Parameters
attrs (
List
[Attribute
]) – Expressions to use as input to the matrix multiplication, in order.transformation (
List
[List
[float
]]) – Matrix to use to define linear transformation via matrix multiplication - e.g. output ofleapyear.analytics.pca()
.name (
str
) – Common prefix for the name of the attributes to be created.
- Returns
DataSet with transformed attributes appended.
- Return type
-
sample
(fraction, *, with_replacement=False, seed=None)¶ Return a sampled subset of the rows.
- Parameters
- Returns
DataSet containing sampled subset of the rows.
- Return type
-
distinct
()¶ Return a DataSet that contains only the unique rows from this Dataset.
- Return type
DataSet
-
drop_duplicates
(subset=None)¶ Return a DataSet with duplicates in the provided columns dropped.
If all columns are named or subset is
None
, this is equivalent todistinct()
.Errors when subset is an empty list.
-
except_
(ds)¶ Return a DataSet of rows in this DataSet but not in another DataSet.
- Parameters
ds (
DataSet
) – DataSet to compare with.- Returns
DataSet containing rows in this DataSet but not in another DataSet.
- Return type
-
difference
(ds)¶ Return a DataSet of rows in this DataSet but not in another DataSet.
- Parameters
ds (
DataSet
) – DataSet to compare with.- Returns
DataSet containing rows in this DataSet but not in another DataSet.
- Return type
-
symmetric_difference
(ds)¶ Return the symmetric difference of the two DataSets.
- Parameters
ds (
DataSet
) – DataSet to compare with.- Returns
DataSet containing rows that are not in the intersection of the two DataSets.
- Return type
-
intersect
(ds)¶ Return intersection of the two DataSets.
- Parameters
ds (
DataSet
) – DataSet to compare with.- Returns
DataSet containing rows that are in the intersection of the two DataSets.
- Return type
-
order_by
(*attrs)¶ Order DataSet by the given expressions.
-
limit
(n)¶ Limit the number of rows.
-
cache
(storageLevel=StorageLevel.MEMORY_AND_DISK)¶ Cache a Dataset on disk on the server-side.
- Parameters
storageLevel (
StorageLevel
) –StorageLevel for persisting datasets. The meaning of these values is documented in the Spark RDD programming guide.
- Returns
DataSet, which indicates to the system to lazily cache the DataSet
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
unpersist
()¶ Immediately begins unpersisting the DataSet on the server-side.
- Returns
- Return type
None
Example
Build a cache and unpersist it
>>> la.count_rows(ds.cache()).run() >>> la.mean(ds["foo"]).run() # this will hit the cache >>> ds.unpersist() >>> la.sum(ds["foo"]).run() # this will no longer hit the cache
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
repartition
(numPartitions, *attrs)¶ Repartition a Dataset by hashing the columns.
- Returns
DataSet partitioned according to the specified parameters.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
sortWithinPartitions
(*attrs)¶ Sorts Dataset rows by the provided columns within partitions, not globally.
- Returns
DataSet sorted within partitions.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
replace
(col, replacement)¶ Replace values matching keys in replacement map.
- Parameters
- Returns
New Dataset where specified values in a given column are replaced according to the replacement map.
- Return type
-
fill
(col, value)¶ Fill missing values in a given column.
- Parameters
- Returns
A new DataSet where
NULL
values in a given column are replaced with the specified expression.- Return type
-
drop
(*cols, how='any')¶ Drop rows where specified columns contain
NULL
values.- Parameters
- Returns
A new DataSet filtered to rows where specified columns contain
NULL
values (any or all, depending on the value of the how parameter).- Return type
-
join_pandas
(from_col, dataframe, key_col, value_col)¶ Join a column of pandas data with the data set.
See
join_data
for extended details.- Parameters
from_col (
Union
[Attribute
,str
]) – An expression or attribute name in the data set to join the data to. The type of this column should match the keys in the mapping.dataframe (
DataFrame
) – The pandas DataFrame that contains values to map to in this data set. The DataFrame cannot be empty and has a limit of 100,000 rows.key_col (
str
) – The name of the pandas column to obtain the keys to matchfrom_col
. If duplicated keys occur, only one key from the data set is used in the join.value_col (
str
) – The name of the pandas column to obtain the values of the mapping.
- Returns
The original data set with a new column containing analyst-supplied values.
- Return type
-
join_data
(from_col, new_col, mapping)¶ Join key-value data to the data set.
Analyst supplied data can be added to an existing sensitive data set without a loss to privacy. This is a replacement for the
decode
expression when there are many keys (>100).Values matching the keys of the dictionary
mapping
are replaced by the associated values. Values that do not match any of the keys are replaced byNULL
. Keys and values may be python literals supported by the LeapYear client, or other Attributes.Note
If the combination of keys assure there should be no
NULL
values, the client will not automatically convert the result of join_data to a non-nullable type. The user must use coalesce to removeNULL
from the domain of possible values.- Parameters
from_col (
Union
[Attribute
,str
]) – An expression or attribute name in the data set to join the data to. The type of this column should match the keys in the mapping.new_col (
str
) – The name of the new attribute that is added to the returned data set with the same type as the mapping values.mapping (
Mapping
[Union
[float
,bool
,int
,str
,datetime
,date
],Union
[float
,bool
,int
,str
,datetime
,date
]]) – Python dictionary of key-value pairs to add to the data set. The keys should be unique. The mapping cannot be empty and has a limit of 100,000 keys.
- Returns
The original data set with a new column containing analyst-supplied values.
- Return type
Example
Suppose that we have a table with a
Sex
column containing the valuesmale
andfemale
:Sex
Age
male
22
female
38
female
26
female
35
male
35
We’d like to encode the abbreviations (coming from the first letter) as a new column, coming from a pandas DataFrame of the following form:
Sex
first_letter
male
m
female
f
To do so, we call this function, assuming the LeapYear DataSet is called
ds
, and we wish to call the new columnsex_first_letter
:new_ds = ds.join_data("Sex", "sex_first_letter", {"female": "f", "male": "m"})
This will produce a DataSet which looks like:
Sex
Age
sex_first_letter
male
22
m
female
38
f
female
26
f
female
35
f
male
35
m
-
predict
(model, attrs=None, name='predict')¶ Return a DataSet with a prediction column for the model.
- Parameters
model (
Union
[ClusterModel
,GLM
,RandomForestClassifier
,RandomForestRegressor
,GradientBoostedTreeClassifier
]) – The model to predict outcomes with.attrs (
Optional
[List
[Union
[Attribute
,str
]]]) – The attributes to use in the transformation, orNone
if all the attributes should be used.name (
str
) – Name or common prefix of the attribute(s) to be created.
- Returns
The original dataset with the prediction column(s) appended.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
predict_proba
(model, attrs=None, name='proba')¶ Return a DataSet with prediction probability columns for the model.
- Parameters
model (
Union
[GLM
,RandomForestClassifier
,GradientBoostedTreeClassifier
]) – The model to predict outcomes with.attrs (
Optional
[List
[Union
[Attribute
,str
]]]) – The attributes to use in the transformation, orNone
if all the attributes should be used.name (
str
) – Name or common prefix of the attribute(s) to be created.
- Returns
The original dataset with the prediction probability column(s) appended.
- Return type
- Unsupported Backends
Not supported for the following LeapYear compute backend(s): snowflake.
-
property
Attribute class¶
-
class
leapyear.dataset.
Attribute
(expr, relation, *, ordering=OrderSpec(osDirection=SortDirection.Asc, osNullOrdering=None), name=None)¶ An attribute of a Dataset.
This exists for transformations to be performed on single attributes of the dataset. For example, the attribute height might be measured in meters however centimeters might be more appropriate, so an intermediate Attribute can be extracted from the DataSet and manipulated.
All attribute manipulations are lazy; they are not evaluated until they are needed to perform an analysis on the LeapYear server.
-
property
type
¶ Get the data type of this attribute.
- Return type
AttributeType
-
property
expression
¶ Get the Attribute’s expression.
- Return type
Expression
-
property
ordering
¶ Get the ordering of this attribute.
- Return type
OrderSpec
-
alias
(name)¶ Associate an alias with an attribute.
- Parameters
name (
str
) – The name to associate with an attribute.- Return type
Attribute
-
is_not
()¶ Return the boolean inverse of the attribute.
- Return type
Attribute
-
sign
()¶ Return the sign of each element (1, 0 or -1).
- Return type
Attribute
-
floor
()¶ Apply the floor transformation to the attribute.
Each attribute value is transformed to the largest integer less than or equal to the attribute value.
- Return type
Attribute
-
ceil
()¶ Apply the ceiling transformation to the attribute.
Each attribute value is transformed to the smallest integer greater than or equal to the attribute value.
- Return type
Attribute
-
exp
()¶ Apply exponent transformation to an attribute.
- Return type
Attribute
-
expm1
()¶ Apply exponent transformation to an attribute and subtract one.
- Return type
Attribute
-
sqrt
()¶ Apply square root transformation to an attribute.
- Return type
Attribute
-
log
()¶ Apply natural logarithm transformation to an attribute.
- Return type
Attribute
-
log1p
()¶ Apply natural logarithm transformation to an attribute and add 1.
- Return type
Attribute
-
sigmoid
()¶ Apply sigmoid transformation to an attribute.
- Return type
Attribute
-
replace
(mapping)¶ Replace specified values with the new values or attribute expressions.
Values matching the keys of the mapping and replaced by the associated values. Values that do not match any of the keys are kept.
Keys and values may be python literals supported by the LeapYear client, or other Attributes.
- Parameters
mapping (
Mapping
[Union
[float
,bool
,int
,str
,datetime
,date
],Union
[Attribute
,Expression
,float
,bool
,int
,str
,datetime
,date
]]) – A mapping from values of this attribute’s type (T) to another set of values and a different type (U). U may be a python literal type (int, bool, datetime, …) or another Attribute.- Returns
The converted attribute.
- Return type
-
property
microsecond
¶ Return the microseconds part of a temporal type.
- Return type
Attribute
-
property
second
¶ Return the seconds part of a temporal type.
- Return type
Attribute
-
property
minute
¶ Return the minutes part of a temporal type.
- Return type
Attribute
-
property
hour
¶ Return the hours part of a temporal type.
- Return type
Attribute
-
property
day
¶ Return the days part of a temporal type.
- Return type
Attribute
-
property
month
¶ Return the months part of a temporal type.
- Return type
Attribute
-
property
year
¶ Return the years part of a temporal type.
- Return type
Attribute
-
greatest
(attrs)¶ Take the elementwise maximum.
Return
NULL
if and only if all attribute values areNULL
.- Return type
Attribute
-
least
(attrs)¶ Take the elementwise minimum.
Return
NULL
if and only if all entries areNULL
.- Return type
Attribute
-
coalesce
(fallthrough, attrs=None)¶ Convert all
NULL
values of an attribute to a value.This function will extend the type of the attribute if necessary and drop the nullable tag from the attribute data type.
- Parameters
- Returns
The non-nullable attribute with extended type range (if necessary).
- Return type
Examples
1. Use coalesce to replace missing values with ‘1’ and create a new attribute ‘col1_trn’:
>>> ds2 = ds1.with_attributes({'col1_trn':ds1['col1'].coalesce('1')})
-
isnull
()¶ Boolean check whether an attribute value is
NULL
.Return a boolean attribute which resolves to
True
whenever the underlying attribute value isNULL
.- Return type
Attribute
-
decode
(mapping)¶ Map specified values to the new values or attribute expressions.
Values matching the keys of the mapping and replaced by the associated values.
Values that do not match any of the keys are replaced by
NULL
.Keys and values may be python literals supported by the LeapYear client, or other Attributes.
If the combination of keys assure there should be no
NULL
values, the client will not automatically convert the result of decode to a non-nullable type. The user must use coalesce to removeNULL
from the domain of possible values.- Parameters
mapping (
Mapping
[Any
,Any
]) – A mapping from values of this attribute’s type (T) to another set of values and a different type (U). U may be a python literal type (int, bool, datetime, …) or another Attribute.- Returns
The converted attribute.
- Return type
Examples
1. Create a new column ‘col1_trn’ that is based on values in ‘col1’. If the value matches ‘pattern’ we assign the same string otherwise we assign ‘Other’ using the decode function:
>>> import leapyear.functions as f >>> some_func = lambda x: (x != 'pattern').decode({True:'Other', False:'pattern'}) >>> ds2 = ds1.with_attributes({'col1_trn': some_func(f.col('col1'))})
2. Create a new column ‘col1_trn’ that is based on values in ‘col1’. If the value == 0 then ‘col1_trn’ takes the value 0. Otherwise, it takes the value of ‘col2’:
>>> import leapyear.functions as f >>> some_func = lambda x: (x==0).decode({True:0,False:f.col('col2')}) >>> ds2 = ds1.with_attributes({'col1_trn': some_func(f.col('col1'))})
3. Create a new column ‘col1_trn’ that is based on values in ‘col1’ and ‘col2. Based on an expression involving ‘col1’ and ‘col2’, ‘col1_trn’ takes the value ‘col1 or ‘col2’:
>>> import leapyear.functions as f >>> def some_func(x,y): return ((x==0)&(y!=0)).decode({True:f.col('col1'),False:f.col('col2')}) >>> ds2 = ds1.with_attributes({'col1_trn': some_func(f.col('col1'),f.col('col2'))})
-
is_in
(options)¶ Boolean check whether an attribute value is in a list.
Return a boolean attribute which resolves to
True
whenever the underlying attribute matches one of the list entries.- Return type
Attribute
-
text_to_bool
()¶ Convert the attribute to a boolean.
Strings that match (case insensitively) “1”, “y”, “yes”, “t”, or “true” are converted to
True
;Strings that match (case insensitively) “0”, “n”, “no”, “f”, or “false” are converted to
False
. Other strings are treated asNULL
.- Return type
Attribute
-
text_to_real
(lb, ub)¶ Convert a text attribute to a real-valued attribute.
- Parameters
- Return type
Attribute
-
text_to_factor
(distinct_vals_list)¶ Convert a text attribute to a factor attribute.
-
text_to_int
(lb, ub)¶ Convert a text attribute to an integer-valued attribute.
NOTE: integers close to
MAX_INT64
orMIN_INT64
are not represented precisely.
-
as_real
()¶ Convert the attribute to a real.
- Return type
Attribute
-
as_factor
()¶ Represent this attribute as a factor attribute.
Converts the attribute of type INT or BOOL to FACTOR.
Note: For more complex conversions, consider
decode()
.Examples
Convert an attribute ‘col1’ in ds1 to factor:
>>> ds2 = ds1.with_attributes({'col1_fac':ds1['col1'].as_factor()})
- Return type
Attribute
-
asc
()¶ Sort ascending.
Causes the attribute to be sorted in ascending order when the sorting order is applied to the dataset.
Examples
Order a dataset by multiple cols and drop duplicates:
>>> ds2 = ds1.order_by(ds1['col1'].asc(),ds1['col2'].asc(),ds1['col3'].asc()) >>> ds2 = ds2.drop_duplicates(['col2'])
- Return type
Attribute
-
asc_nulls_first
()¶ Sort ascending, missing values first.
Causes the attribute to be sorted in ascending order when the sorting order is applied to the dataset, with
NULL
values first.- Return type
Attribute
-
asc_nulls_last
()¶ Sort ascending, missing values last.
Causes the attribute to be sorted in ascending order when the sorting order is applied to the dataset, with
NULL
values last.- Return type
Attribute
-
desc
()¶ Sort descending.
Causes the attribute to be sorted in descending order when the sorting order is applied to the dataset.
- Return type
Attribute
-
desc_nulls_first
()¶ Sort descending, missing values first.
Causes the attribute to be sorted in descending order when the sorting order is applied to the dataset, with
NULL
values first.- Return type
Attribute
-
desc_nulls_last
()¶ Sort descending, missing values last.
Causes the attribute to be sorted in descending order when the sorting order is applied to the dataset, with
NULL
values last.- Return type
Attribute
-
property
-
class
leapyear.dataset.
AttributeType
(*args, **kwargs)¶ The type of an Attribute.
An AttributeType is a read-only object returned by the server, and should never be constructed directly.
Aliases¶
-
leapyear.dataset.attribute.
AttributeLike
(*args, **kwargs)¶
Attribute-like objects are those that can be readily converted to an attribute.
These include existing attributes, dates, strings, integers, floats and expressions based on these objects.
Grouping and Windowing classes¶
-
class
leapyear.dataset.
GroupedData
(grouping, rel)¶ The result of a
DataSet.group_by()
.Run
agg()
to get a DataSet.-
agg
(*attr_aggs, max_count=None, **kwargs)¶ Specify the aggregations to perform on the GroupedData.
- Parameters
attr_aggs (
Tuple
[List
[Union
[Attribute
,str
]],Union
[str
,Aggregation
]]) –A list of pairs. The second element of the pair is the aggregation to perform; the first is a list of columns on which to perform it. This is a list and not just a single attribute to support nullary and binary aggregations.
Available aggregations:
min
max
and
or
count_distinct
approx_count_distinct
count
mean
stddev
stddev_samp
stddev_pop
variance
var_samp
var_pop
skewness
kurtosis
sum
sum_distinct
covar_samp
covar_pop
corr
If count aggregate is requested, this parameter will be used as an upper bound in the schema of the derived aggregate count attribute(s) of the grouped
DataSet
. If not supplied, the upper bound would be inferred from the data before returning the resultingDataSet
object.Note
When this parameter is set too high, differentially private computations on the derived aggregate attribute would include higher randomization effect. When it is set too low, all counts higher than
max_count
will be replaced bymax_count
.kwargs (
Any
) – All keyword arguments are passed to the functionrun
when the parametermax_count
has to be inferred from the data. This includesmax_timeout_sec
, which defaults to 300 seconds.
- Returns
A
DataSet
object, containing aggregation results.- Return type
Example
To compute correlation of height and weight as well as the count, run:
>>> gd.agg((['height','weight'], 'corr'), ([], 'count'))
-
-
class
leapyear.dataset.
Window
¶ Utility functions for defining WindowSpec.
Examples
>>> # ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW >>> window = Window.order_by("date").rows_between( >>> Window.unbounded_preceding, Window.current_row, >>> )
>>> # PARTITION BY country ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING >>> window = Window.order_by("date").partition_by("country").rows_between(-3, 3)
-
classmethod
partition_by
(*cols)¶ Create a WindowSpec with the partitioning defined.
-
classmethod
order_by
(*cols)¶ Create a WindowSpec with the ordering defined.
-
classmethod
rows_between
(start, end)¶ Create a WindowSpec with the frame boundaries defined.
Create a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive). Both start and end are relative positions from the current row. For example, “0” means “current row”, while “-1” means the row before the current row, and “5” means the fifth row after the current row. We recommend users use Window.unboundedPreceding, Window.unboundedFollowing, and Window.currentRow to specify special boundary values, rather than using integral values directly.
Note: windows of 1000 rows or more are not currently supported for expressions other than lead and lag.
- Parameters
start (
int
) – boundary start, inclusive. The frame is unbounded if this isWindow.unboundedPreceding
, or any value less than or equal to -9223372036854775808.end (
int
) – boundary end, inclusive. The frame is unbounded if this isWindow.unboundedFollowing
, or any value greater than or equal to 9223372036854775807.
- Return type
WindowSpec
-
classmethod