DataSource module#

Module with all classes and methods to manage the data sources and data sets

MLOpsDataSourceClient#

class mlops_codex.datasources.MLOpsDataSourceClient(*, login: str | None = None, password: str | None = None, url: str | None = None)[source]#

Bases: BaseMLOpsClient

Class for client for manage datasources

Parameters:
  • login (str) – Login for authenticating with the client. You can also use the env variable MLOPS_USER to set this

  • password (str) – Password for authenticating with the client. You can also use the env variable MLOPS_PASSWORD to set this

  • url (str) – URL to MLOps Server. Default value is https://neomaril.datarisk.net/, use it to test your deployment first before changing to production. You can also use the env variable MLOPS_URL to set this

Raises:
credentials_to_json(input_data: dict) str[source]#

Transform dict to json.

Parameters:

input_data (dict) – A dictionary to save.

Returns:

Path to the credentials file.

Return type:

str

get_datasource(*, datasource_name: str, provider: str, group: str)[source]#

Get a MLOpsDataSource to make datasource operations.

Parameters:
  • datasource_name (str) – Name given previously to the datasource.

  • provider (str) – It can be “Azure”, “AWS” or “GCP”

  • group (str) – Name of the group where we will search the datasources

Returns:

A MLOpsDataSource object

Return type:

MLOpsDataSource

Example

>>> client.get_datasource(datasource_name='MyDataSourceName', provider='GCP', group='my_group')
list_datasources(*, provider: str, group: str)[source]#

List all datasources of the group with this provider type.

Parameters:
  • group (str) – Name of the group where we will search the datasources

  • provider (str ("Azure" | "AWS" | "GCP")) –

Raises:
Returns:

A list of datasources information.

Return type:

list

Example

>>> client.list_datasources(provider='GCP', group='my_group')
register_datasource(*, datasource_name: str, provider: str, cloud_credentials: dict | str, group: str)[source]#

Register the user cloud credentials to allow MLOps to use the provider to download the datasource.

Parameters:
  • group (str) – Name of the group where we will search the datasources.

  • datasource_name (str) – Name given previously to the datasource.

  • provider (str) – It can be “Azure”, “AWS” or “GCP”

  • cloud_credentials (str | Union[dict,str]) – Path or dict to a JSON with the credentials to access the provider.

Returns:

A MLOpsDataSource object

Return type:

MLOpsDataSource

Example

>>> client.register_datasource(
>>>     datasource_name='MyDataSourceName',
>>>     provider='GCP',
>>>     cloud_credentials='./gcp_credentials.json',
>>>     group='my_group'
>>> )

MLOpsDataSource#

class mlops_codex.datasources.MLOpsDataSourceClient(*, login: str | None = None, password: str | None = None, url: str | None = None)[source]#

Bases: BaseMLOpsClient

Class for client for manage datasources

Parameters:
  • login (str) – Login for authenticating with the client. You can also use the env variable MLOPS_USER to set this

  • password (str) – Password for authenticating with the client. You can also use the env variable MLOPS_PASSWORD to set this

  • url (str) – URL to MLOps Server. Default value is https://neomaril.datarisk.net/, use it to test your deployment first before changing to production. You can also use the env variable MLOPS_URL to set this

Raises:
credentials_to_json(input_data: dict) str[source]#

Transform dict to json.

Parameters:

input_data (dict) – A dictionary to save.

Returns:

Path to the credentials file.

Return type:

str

get_datasource(*, datasource_name: str, provider: str, group: str)[source]#

Get a MLOpsDataSource to make datasource operations.

Parameters:
  • datasource_name (str) – Name given previously to the datasource.

  • provider (str) – It can be “Azure”, “AWS” or “GCP”

  • group (str) – Name of the group where we will search the datasources

Returns:

A MLOpsDataSource object

Return type:

MLOpsDataSource

Example

>>> client.get_datasource(datasource_name='MyDataSourceName', provider='GCP', group='my_group')
list_datasources(*, provider: str, group: str)[source]#

List all datasources of the group with this provider type.

Parameters:
  • group (str) – Name of the group where we will search the datasources

  • provider (str ("Azure" | "AWS" | "GCP")) –

Raises:
Returns:

A list of datasources information.

Return type:

list

Example

>>> client.list_datasources(provider='GCP', group='my_group')
register_datasource(*, datasource_name: str, provider: str, cloud_credentials: dict | str, group: str)[source]#

Register the user cloud credentials to allow MLOps to use the provider to download the datasource.

Parameters:
  • group (str) – Name of the group where we will search the datasources.

  • datasource_name (str) – Name given previously to the datasource.

  • provider (str) – It can be “Azure”, “AWS” or “GCP”

  • cloud_credentials (str | Union[dict,str]) – Path or dict to a JSON with the credentials to access the provider.

Returns:

A MLOpsDataSource object

Return type:

MLOpsDataSource

Example

>>> client.register_datasource(
>>>     datasource_name='MyDataSourceName',
>>>     provider='GCP',
>>>     cloud_credentials='./gcp_credentials.json',
>>>     group='my_group'
>>> )

MLOpsDataset#

class mlops_codex.datasources.MLOpsDataset(*, login: str, password: str, url: str, dataset_hash: str, dataset_name: str, group: str, origin: str)[source]#

Bases: BaseModel

Dataset class to represent mlops dataset.

Parameters:
  • login (str) – Login for authenticating with the client.

  • password (str) – Password for authenticating with the client.

  • url (str) – URL to MLOps Server. Default value is https://neomaril.datarisk.net, use it to test your deployment first before changing to production. You can also use the env variable MLOPS_URL to set this

  • dataset_hash (str) – Dataset hash to download.

  • dataset_name (str) – Name of the dataset.

  • group (str) – Name of the group where we will search the dataset

  • origin (str) – Origin of the dataset. It can be “Training”, “Preprocessing”, “Datasource” or “Model”

class Config[source]#

Bases: object

arbitrary_types_allowed = True#
dataset_hash: str#
dataset_name: str#
download(*, path: str = './', filename: str = 'dataset') None[source]#

Download a dataset from mlops. The dataset will be a csv or parquet file.

Parameters:
  • path (str, optional) – Path to the downloaded dataset. Defaults to ‘./’.

  • filename (str, optional) – Name of the downloaded dataset. Defaults to ‘dataset.parquet’ or ‘dataset.csv’.

group: str#
host_preprocessing(*, name: str, group: str, script_path: str, entrypoint_function_name: str, requirements_path: str, python_version: str | None = '3.9')[source]#

Host a preprocessing script via dataset module. By default, the user will host and wait the hosting. It returns a MLOpsPreprocessingAsyncV2, then you can run it.

Parameters:
  • name (str) – Name of the new preprocessing script

  • group (str) – Group of the new preprocessing script Dataset to upload schema to

  • script_path (str) – Path to the python script

  • entrypoint_function_name (str) – Name of the entrypoint function in the python script

  • python_version (str) – Python version for the model environment. Available versions are 3.8, 3.9, 3.10. Defaults to ‘3.9’

  • requirements_path (str) – Path to the requirements file

Returns:

Preprocessing async version of the new preprocessing script.

Return type:

MLOpsPreprocessingAsyncV2

login: str#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) None#

We need to both initialize private attributes and call the user-defined model_post_init method.

origin: str#
password: str#
run_model()[source]#
run_preprocess(*, preprocessing_script_hash: str, execution_id: int)[source]#

Run a preprocessing script execution from a dataset. By default, the user will run the preprocessing script and wait until it completes.

Parameters:
  • preprocessing_script_hash (str) – Hash of the preprocessing script

  • execution_id (int) – Preprocessing Execution ID

train()[source]#
url: str#