Dataset Module#

Module with all classes and methods to manage a MLOps Dataset

MLOpsDataset#

class mlops_codex.dataset.MLOpsDataset(*, login: str, password: str, url: str, dataset_hash: str, dataset_name: str, group: str, origin: str)[source]#

Bases: BaseModel

Dataset class to represent mlops dataset.

Parameters:
  • login (str) – Login for authenticating with the client.

  • password (str) – Password for authenticating with the client.

  • url (str) – URL to MLOps Server. Default value is https://neomaril.datarisk.net, use it to test your deployment first before changing to production. You can also use the env variable MLOPS_URL to set this

  • dataset_hash (str) – Dataset hash to download.

  • dataset_name (str) – Name of the dataset.

  • group (str) – Name of the group where we will search the dataset

  • origin (str) – Origin of the dataset. It can be “Training”, “Preprocessing”, “Datasource” or “Model”

dataset_hash: str#
dataset_name: str#
download(*, path: str = './', filename: str = 'dataset') None[source]#

Download a dataset from mlops. The dataset will be a csv or parquet file.

Parameters:
  • path (str, optional) – Path to the downloaded dataset. Defaults to ‘./’.

  • filename (str, optional) – Name of the downloaded dataset. Defaults to ‘dataset.parquet’ or ‘dataset.csv’.

group: str#
host_preprocessing(*, name: str, group: str, script_path: str, entrypoint_function_name: str, requirements_path: str, python_version: str | None = '3.9')[source]#

Host a preprocessing script via dataset module. By default, the user will host and wait the hosting. It returns a MLOpsPreprocessingAsyncV2, then you can run it.

Parameters:
  • name (str) – Name of the new preprocessing script

  • group (str) – Group of the new preprocessing script Dataset to upload schema to

  • script_path (str) – Path to the python script

  • entrypoint_function_name (str) – Name of the entrypoint function in the python script

  • python_version (str) – Python version for the model environment. Available versions are 3.8, 3.9, 3.10. Defaults to ‘3.9’

  • requirements_path (str) – Path to the requirements file

Returns:

Preprocessing async version of the new preprocessing script.

Return type:

MLOpsPreprocessingAsyncV2

login: str#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

origin: str#
password: str#
run_preprocess(*, preprocessing_script_hash: str, execution_id: int)[source]#

Run a preprocessing script execution from a dataset. By default, the user will run the preprocessing script and wait until it completes.

Parameters:
  • preprocessing_script_hash (str) – Hash of the preprocessing script

  • execution_id (int) – Preprocessing Execution ID

url: str#

MLOpsDatasetClient#

class mlops_codex.dataset.MLOpsDatasetClient(login: str, password: str, url: str)[source]#

Bases: BaseMLOpsClient

Class to operate actions in a dataset.

Parameters:
  • login (str) – Login for authenticating with the client. You can also use the env variable MLOPS_USER to set this

  • password (str) – Password for authenticating with the client. You can also use the env variable MLOPS_PASSWORD to set this

  • url (str) – URL to MLOps Server. Default value is https://neomaril.datarisk.net, use it to test your deployment first before changing to production. You can also use the env variable MLOPS_URL to set this

delete(group: str, dataset_hash: str) None[source]#

Delete the dataset on mlops. Pay attention when doing this action, it is irreversible!

Parameters:
  • group (str) – Group to delete.

  • dataset_hash (str) – Dataset hash to delete.

Example

>>> dataset.delete()
download(group: str, dataset_hash: str, path: str | None = './', filename: str | None = 'dataset') None[source]#

Download a dataset from mlops. The dataset will be a csv or parquet file.

Parameters:
  • group (str) – Name of the group

  • dataset_hash (str) – Dataset hash

  • path (str, optional) – Path to the downloaded dataset. Defaults to ‘./’.

  • filename (str, optional) – Name of the downloaded dataset. Defaults to ‘dataset.zip’.

Raises:
list_datasets(*, origin: str | None = None, origin_id: int | None = None, datasource_name: str | None = None, group: str | None = None) List[source]#

List datasets from datasources.

Parameters:
  • origin (Optional[str]) – Origin of a dataset. It can be “Training”, “Preprocessing”, “Datasource” or “Model”

  • origin_id (Optional[str]) – Integer that represents the id of a dataset, given an origin

  • datasource_name (Optional[str]) – Name of the datasource

  • group (Optional[str]) – Name of the group where we will search the dataset

Returns:

A list of datasets information.

Return type:

list

Example

>>> dataset.list_datasets()