Dataset Module#
Module with all classes and methods to manage a MLOps Dataset
MLOpsDataset#
- class mlops_codex.dataset.MLOpsDataset(*, login: str, password: str, url: str, dataset_hash: str, dataset_name: str, group: str, origin: str)[source]#
Bases:
BaseModel
Dataset class to represent mlops dataset.
- Parameters:
login (str) – Login for authenticating with the client.
password (str) – Password for authenticating with the client.
url (str) – URL to MLOps Server. Default value is https://neomaril.datarisk.net, use it to test your deployment first before changing to production. You can also use the env variable MLOPS_URL to set this
dataset_hash (str) – Dataset hash to download.
dataset_name (str) – Name of the dataset.
group (str) – Name of the group where we will search the dataset
origin (str) – Origin of the dataset. It can be “Training”, “Preprocessing”, “Datasource” or “Model”
- dataset_hash: str#
- dataset_name: str#
- download(*, path: str = './', filename: str = 'dataset') None [source]#
Download a dataset from mlops. The dataset will be a csv or parquet file.
- Parameters:
path (str, optional) – Path to the downloaded dataset. Defaults to ‘./’.
filename (str, optional) – Name of the downloaded dataset. Defaults to ‘dataset.parquet’ or ‘dataset.csv’.
- group: str#
- host_preprocessing(*, name: str, group: str, script_path: str, entrypoint_function_name: str, requirements_path: str, python_version: str | None = '3.9')[source]#
Host a preprocessing script via dataset module. By default, the user will host and wait the hosting. It returns a MLOpsPreprocessingAsyncV2, then you can run it.
- Parameters:
name (str) – Name of the new preprocessing script
group (str) – Group of the new preprocessing script Dataset to upload schema to
script_path (str) – Path to the python script
entrypoint_function_name (str) – Name of the entrypoint function in the python script
python_version (str) – Python version for the model environment. Available versions are 3.8, 3.9, 3.10. Defaults to ‘3.9’
requirements_path (str) – Path to the requirements file
- Returns:
Preprocessing async version of the new preprocessing script.
- Return type:
- login: str#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- origin: str#
- password: str#
- run_preprocess(*, preprocessing_script_hash: str, execution_id: int)[source]#
Run a preprocessing script execution from a dataset. By default, the user will run the preprocessing script and wait until it completes.
- Parameters:
preprocessing_script_hash (str) – Hash of the preprocessing script
execution_id (int) – Preprocessing Execution ID
- url: str#
MLOpsDatasetClient#
- class mlops_codex.dataset.MLOpsDatasetClient(login: str, password: str, url: str)[source]#
Bases:
BaseMLOpsClient
Class to operate actions in a dataset.
- Parameters:
login (str) – Login for authenticating with the client. You can also use the env variable MLOPS_USER to set this
password (str) – Password for authenticating with the client. You can also use the env variable MLOPS_PASSWORD to set this
url (str) – URL to MLOps Server. Default value is https://neomaril.datarisk.net, use it to test your deployment first before changing to production. You can also use the env variable MLOPS_URL to set this
- delete(group: str, dataset_hash: str) None [source]#
Delete the dataset on mlops. Pay attention when doing this action, it is irreversible!
- Parameters:
group (str) – Group to delete.
dataset_hash (str) – Dataset hash to delete.
Example
>>> dataset.delete()
- download(group: str, dataset_hash: str, path: str | None = './', filename: str | None = 'dataset') None [source]#
Download a dataset from mlops. The dataset will be a csv or parquet file.
- Parameters:
group (str) – Name of the group
dataset_hash (str) – Dataset hash
path (str, optional) – Path to the downloaded dataset. Defaults to ‘./’.
filename (str, optional) – Name of the downloaded dataset. Defaults to ‘dataset.zip’.
- Raises:
AuthenticationError – Raised if there is an authentication issue.
DatasetNotFoundError – Raised if there is no dataset with the given name.
ServerError – Raised if the server encounters an issue.
- list_datasets(*, origin: str | None = None, origin_id: int | None = None, datasource_name: str | None = None, group: str | None = None) List [source]#
List datasets from datasources.
- Parameters:
origin (Optional[str]) – Origin of a dataset. It can be “Training”, “Preprocessing”, “Datasource” or “Model”
origin_id (Optional[str]) – Integer that represents the id of a dataset, given an origin
datasource_name (Optional[str]) – Name of the datasource
group (Optional[str]) – Name of the group where we will search the dataset
- Returns:
A list of datasets information.
- Return type:
list
Example
>>> dataset.list_datasets()