Dataset

The Dataset entity represents a feature labeling transformation, applied to a Datasource, before initiating training models. Alternatively, a Dataset, is a translated version of raw data that is presented in a format that Firefly.ai can read. A Dataset is the same data from the raw CSV file presented as a list of numerical, categorical and date/time features, alongside respected feature roles (target, block-id, etc.).

‘Dataset’ API includes creating a Dataset from a Datasource and querying existing Datasets (Get, List, Preview and Delete).

class fireflyai.resources.dataset.Dataset[source]
classmethod create(datasource_id: int, dataset_name: str, target: str, problem_type: fireflyai.enums.ProblemType, header: bool = True, na_values: List[str] = None, retype_columns: Dict[str, fireflyai.enums.FeatureType] = None, rename_columns: List[str] = None, datetime_format: str = None, time_axis: str = None, block_id: List[str] = None, sample_id: List[str] = None, subdataset_id: List[str] = None, sample_weight: List[str] = None, not_used: List[str] = None, hidden: List[str] = False, wait: bool = False, skip_if_exists: bool = False, api_key: str = None) → fireflyai.firefly_response.FireflyResponse[source]

Creates and prepares a Dataset.

While creating a Dataset, the feature roles are labeled and the feature types can be set by the user. Data analysis is done in order to optimize model training and search process.

Parameters:
  • datasource_id (int) – Datasource ID.
  • dataset_name (str) – The name of the Dataset.
  • target (str) – The name of the target feature, or its column index if header=False.
  • problem_type (ProblemType) – The problem type.
  • header (bool) – Does the file include a header row or not.
  • na_values (Optional[List[str]]) – List of user specific Null values.
  • retype_columns (Dict[str, FeatureType]) – Change the types of certain columns.
  • rename_columns (Optional[List[str]]) – ??? #TODO
  • datetime_format (Optional[str]) – The datetime format used in the data.
  • time_axis (Optional[str]) – In timeseries problems, the feature that is the time axis.
  • block_id (Optional[List[str]]) – To avoid data leakage, data can be split into blocks. Rows with the same block_id, must all be in the train set or the test set. Requires at least 50 unique values in the data.
  • sample_id (Optional[List[str]]) – Row identifier.
  • subdataset_id (Optional[List[str]]) – Features which specify a subdataset ID in the data.
  • sample_weight (Optional[List[str]]) – ??? #TODO
  • not_used (Optional[List[str]]) – List of features to ignore.
  • hidden (Optional[List[str]]) – List of features to mark as hidden.
  • wait (Optional[bool]) – Should the call be synchronous or not.
  • skip_if_exists (Optional[bool]) – Check if a Dataset with same name exists and skip if true.
  • api_key (Optional[str]) – Explicit api_key, not required, if fireflyai.authenticate() was run prior.
Returns:

Dataset ID, if successful and wait=False or Dataset if successful and wait=True; raises FireflyError otherwise.

Return type:

FireflyResponse

classmethod delete(id: int, api_key: str = None) → fireflyai.firefly_response.FireflyResponse[source]

Deletes a specific Dataset.

Parameters:
  • id (int) – Dataset ID.
  • api_key (Optional[str]) – Explicit api_key, not required, if fireflyai.authenticate() was run prior.
Returns:

“true” if deleted successfuly, raises FireflyClientError otherwise.

Return type:

FireflyResponse

classmethod get(id: int, api_key: str = None) → fireflyai.firefly_response.FireflyResponse[source]

Gets information on a specific Dataset.

Information includes the state of the Dataset and other attributes.

Parameters:
  • id (int) – Dataset ID.
  • api_key (Optional[str]) – Explicit api_key, not required if fireflyai.authenticate was run prior.
Returns:

Information about the Dataset.

Return type:

FireflyResponse

classmethod get_available_estimators(id: int, inter_level: fireflyai.enums.InterpretabilityLevel = None, api_key: str = None) → fireflyai.firefly_response.FireflyResponse[source]

Gets possible Estimators for a specific Dataset.

Parameters:
  • id (int) – Dataset ID.
  • inter_level (Optional[InterpretabilityLevel]) – Interpretability level.
  • api_key (Optional[str]) – Explicit api_key, not required if fireflyai.authenticate was run prior.
Returns:

List of possible values for estimators.

Return type:

FireflyResponse

classmethod get_available_pipeline(id: int, inter_level: fireflyai.enums.InterpretabilityLevel = None, api_key: str = None) → fireflyai.firefly_response.FireflyResponse[source]

Gets possible pipeline for a specific dataset.

Parameters:
  • id (int) – Dataset ID to get possible pipeline.
  • inter_level (Optional[InterpretabilityLevel]) – Interpretability level.
  • api_key (Optional[str]) – Explicit api_key, not required if fireflyai.authenticate was run prior.
Returns:

List of possible values for pipeline.

Return type:

FireflyResponse

classmethod get_available_splitting_strategy(id: int, inter_level: fireflyai.enums.InterpretabilityLevel = None, api_key: str = None) → fireflyai.firefly_response.FireflyResponse[source]

Gets possible splitting strategies for a specific dataset.

Parameters:
  • id (int) – Dataset ID to get possible splitting strategies.
  • inter_level (Optional[InterpretabilityLevel]) – Interpretability level.
  • api_key (Optional[str]) – Explicit api_key, not required if fireflyai.authenticate was run prior.
Returns:

List of possible values for splitting strategies.

Return type:

FireflyResponse

classmethod get_available_target_metric(id: int, inter_level: fireflyai.enums.InterpretabilityLevel = None, api_key: str = None) → fireflyai.firefly_response.FireflyResponse[source]

Gets possible target metrics for a specific dataset.

Parameters:
  • id (int) – Dataset ID to get possible target metrics.
  • inter_level (Optional[InterpretabilityLevel]) – Interpretability level.
  • api_key (Optional[str]) – Explicit api_key, not required if fireflyai.authenticate was run prior.
Returns:

List of possible values for target metrics.

Return type:

FireflyResponse

classmethod get_by_name(name: str, api_key: str = None) → fireflyai.firefly_response.FireflyResponse[source]

Gets information on a specific Dataset identified by its name.

Information includes the state of the Dataset and other attributes. Similar to calling fireflyai.Dataset.list(filters_={‘name’: [NAME]}).

Parameters:
  • name (str) – Dataset name.
  • api_key (Optional[str]) – Explicit api_key, not required if fireflyai.authenticate was run prior.
Returns:

Information about the Dataset.

Return type:

FireflyResponse

classmethod list(search_term: str = None, page: int = None, page_size: int = None, sort: Dict[KT, VT] = None, filter_: Dict[KT, VT] = None, api_key: str = None) → fireflyai.firefly_response.FireflyResponse[source]

Lists the existing Datasets - supports filtering, sorting and pagination.

Parameters:
  • search_term (Optional[str]) – Return only records that contain the search_term in any field.
  • page (Optional[int]) – For pagination, which page to return.
  • page_size (Optional[int]) – For pagination, how many records will appear in a single page.
  • sort (Optional[Dict[str, Union[str, int]]]) – Dictionary of rules to sort the results by.
  • filter (Optional[Dict[str, Union[str, int]]]) – Dictionary of rules to filter the results by.
  • api_key (Optional[str]) – Explicit api_key, not required, if fireflyai.authenticate() was run prior.
Returns:

Datasets are represented as nested dictionaries under hits.

Return type:

FireflyResponse

classmethod train(task_name: str, dataset_id: int, estimators: List[fireflyai.enums.Estimator] = None, target_metric: fireflyai.enums.TargetMetric = None, splitting_strategy: fireflyai.enums.SplittingStrategy = None, notes: str = None, ensemble_size: int = None, max_models_num: int = None, single_model_timeout: int = None, pipeline: List[fireflyai.enums.Pipeline] = None, prediction_latency: int = None, interpretability_level: fireflyai.enums.InterpretabilityLevel = None, timeout: int = 7200, cost_matrix_weights: List[List[str]] = None, train_size: float = None, test_size: float = None, validation_size: float = None, fold_size: int = None, n_folds: int = None, horizon: int = None, validation_strategy: fireflyai.enums.ValidationStrategy = None, cv_strategy: fireflyai.enums.CVStrategy = None, forecast_horizon: int = None, model_life_time: int = None, refit_on_all: bool = None, wait: bool = False, skip_if_exists: bool = False, api_key: str = None) → fireflyai.firefly_response.FireflyResponse[source]

Creates and runs a training task.

A task is responsible for searching hyper-parameters that maximize model scores. The task constructs ensembles made of select models. Seeking ways to combine different models, allows optimal decision making. Similar to calling fireflyai.Task.create(…).

Parameters:
  • task_name (str) – Task name.
  • dataset_id (int) – Dataset ID.
  • estimators (List[Estimator]) – Estimators to use in the train task.
  • target_metric (TargetMetric) – The target metric, is the metric, the search process attempts to optimize.
  • splitting_strategy (SplittingStrategy) – Splitting strategy for the data.
  • notes (Optional[str]) – Notes of the task.
  • ensemble_size (Optional[int]) – Maximum number of models in an ensemble.
  • max_models_num (Optional[int]) – Maximum number of models to train during search process.
  • single_model_timeout (Optional[int]) – Maximum time for training one model.
  • pipeline (Optional[List[Pipeline]) – Pipeline steps to use in the train task.
  • prediction_latency (Optional[int]) – Maximum number of seconds ensemble prediction should take.
  • interpretability_level (Optional[InterpretabilityLevel]) – Determines how interpretable your ensemble is.
  • timeout (Optional[int]) – Timeout, in seconds, for the search process (default: 2 hours).
  • cost_matrix_weights (Optional[List[List[str]]]) – For classification and anomaly detection problems, the weights determine custom cost metric. This assigns different weights to the entries of the confusion matrix.
  • train_size (Optional[int]) – The ratio of data taken for the train set of the model.
  • test_size (Optional[int]) – The ratio of data taken for the test set of the model.
  • validation_size (Optional[int]) – The ratio of data taken for the validation set of the model.
  • fold_size (Optional[int]) – Fold size when performing cross-validation splitting.
  • n_folds (Optional[int]) – Number of folds when performing cross-validation splitting.
  • validation_strategy (Optional[ValidationStrategy]) – Validation strategy used for the train task.
  • cv_strategy (Optional[CVStrategy]) – Cross-validation strategy to use for the train task.
  • horizon (Optional[int]) – Something related to time-series models. #TODO
  • forecast_horizon (Optional[int]) – Something related to time-series models. #TODO
  • model_life_time (Optional[int]) – Something related to time-series models. #TODO
  • refit_on_all (Optional[bool]) – Determines if the final ensemble will be refit on all data after search process is done.
  • wait (Optional[bool]) – Should the call be synchronous or not.
  • skip_if_exists (Optional[bool]) – Check if a Dataset with same name exists and skip if true.
  • api_key (Optional[str]) – Explicit api_key, not required, if fireflyai.authenticate() was run prior.
Returns:

Task ID, if successful and wait=False or Task if successful and wait=True; raises FireflyError otherwise.

Return type:

FireflyResponse