Training your model =================== Training your model in Neomaril helps by logging everything in a model registry, while also saving time by running experiments in parallel. Also, the training module is connected to the deploying and monitoring module, so using the training in Neomaril saves time in the next steps. Training types --------------- First we need to create a training experiment. We aggregate multiple train runs into one training experiment. Each train run can eventually became a deployed model or not. **Custom:** Is when the user will define the whole training code and define the training environment. **AutoML:** Is when the training is pre defined using the AutoML module. The required files are the training data and some configuration parameters for the package. **External:** Is when you perform the model training on your machine or any other place, external to Neomaril. You need to upload it, if you want to monitor your model in our environment. Creating the training experiment -------------------------------- We can create the experiment using the :py:meth:`neomaril_codex.training.NeomarilTrainingClient.create_training_experiment` method. .. code:: python # Start the client training_client = NeomarilTrainingClient() # Creating a new training experiment training = training_client.create_training_experiment( experiment_name='Teste notebook', # Experiment name, this is how you find your model in MLFLow model_type='Classification', # Model type. Can be Classification, Regression or Unsupervised group='datarisk' # This is the default group. Create a new one when using for a new project ) The return of the method is a :py:class:`neomaril_codex.training.NeomarilTrainingExperiment`, where you can create multiple executions that works like versions of the main experiment. Then you need to actually upload a dataset and a training code for your execution (if is a Custom experiment) or the configuration (for the AutoML experiment). Running a training execution ---------------------------- For the custom experiment you need a entrypoint function like this: .. code:: python import pandas as pd from lightgbm import LGBMClassifier from sklearn.impute import SimpleImputer from sklearn.pipeline import make_pipeline from sklearn.model_selection import cross_val_score import os # Function that provides the steps to train the model based on a given dataset. # Its name (train_model) must be passed in the 'training_reference' field # Its parameter (base_path) must contain the folder path for the files that will be used. # An example for its value is: "/path/to/treino/customizado/experimento1" def train_model(base_path): # Environment variables loaded from a user-supplied file in the 'env' field # my_var = os.getenv('MY_VAR') # if my_var is None: # raise Exception("Could not find `env` variable") ## Environment variable loaded from Neomaril with database name (used alternatively to line 61) # df = pd.read_csv(base_path+'/'+os.getenv('inputFileName')) df = pd.read_csv(base_path+"/data.csv") # Load the database (data.csv) which must have the same name # file sent to Neomaril in the 'train_data' field # The data sent must be the complete data for training and validation (excluding the validation sample), # it is at the user's discretion how to treat the data here X = df.drop(columns=['target']) # Separates the database from the column with the targets y = df[["target"]] # Saves the target in a dataframe pipe = make_pipeline(SimpleImputer(), LGBMClassifier()) # Define the steps to train the model # In this example we used cross-validation, but this is at the user's discretion auc = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc") # Validation of results using the 'auc' metric f_score = cross_val_score(pipe, X, y, cv=5, scoring="f1") # Validation of results using the 'f1' metric pipe.fit(X, y) # Train the model # Build the DataFrame with the results results = pd.DataFrame({"pred": pipe.predict(X), "proba": pipe.predict_proba(X)[:,1]}) # Returns the training results according to the parameters expected by Neomaril return {"X_train": X, "y_train": y, "model_output": results, "pipeline": pipe, "metrics": {"auc": auc.mean(), "f1_score": f_score.mean()}} The only parameter on the function is the path for the data file. This way we can execute it when the files are uploaded to Neomaril. In the custom training experiment you can do whatever you want, test multiple algorithms, optimize hyperparameters, validate on multiple segments of the data. The important thing is the return of the function, where we get information about the final model of this version so we can log it. The return must be a dictionary with the following keys: - `X_train`: The dataframe that will be used to fit the model. - `y_train`: The target dataframe/array/series that will be used to fit the model. - `model_output`: A dataframe/array/series with outputs of the model. This can be the predicted values/probabilities, classes or any other useful information. This information needs to be in the output of the future deployed model to be used in the monitoring - `pipeline`: The final fitted model instance. Ideally it should be a `Scikit-Learn Pipeline Class `_, but any other algorithm class that has the *get_params* method implemented works. This will be saved as `model.pkl` with `cloudpickle _` or with the `save_model` method if the algorithm class has that. - `extra`: A optional list of filenames for extra files that are generated in the training. This can be plots, validation datasets, etc. They need to be saved in the same path that is provided as the function parameter. - `metrics`: A dictionary with each key as a metric. You can use any name for the metric key and save as many as you want, but the value must be numeric. Eg: `{"auc_train": 0.7, "auc_test": 0;65}` Besides that we also need the information for the environment (python version and package requirements). Then we can call the :py:meth:`neomaril_codex.training.NeomarilTrainingExperiment.run_training` method. .. code:: python # With the experiment class we can create multiple model runs PATH = './samples/train/' run1 = training.run_training( run_name='First test', # Run name train_data=PATH+'dados.csv', # Path to the file with training data training_type='Custom', # Training type. Can be External, Custom or AutoML source_file=PATH+'app.py', # Path of the source file requirements_file=PATH+'requirements.txt', # Path of the requirements file, #env=PATH+'.env' # File for env variables (this will be encrypted in the server) #extra_files=[PATH+'utils.py'], # List with extra files paths that should be uploaded along (they will be all in the same folder) training_reference='train_model', # The name of the entrypoint function that is going to be called inside the source file python_version='3.9', # Can be 3.8 to 3.10 wait_complete=True ) For the AutoML we just need the data and the configuration parameters. You can check the :doc:`automl_parameters` for more details. .. code:: python PATH = './samples/autoML/' run2 = training.run_training( run_name='First test', # Run name training_type='Custom', # Training type. Can be External, Custom or AutoML train_data=PATH+'dados.csv', # Path to the file with training data conf_dict=PATH+'conf.json', # Path of the configuration file wait_complete=True ) See the example below, using a python script to perform and save an External training: .. code:: python from neomaril_codex.training import NeomarilTrainingClient import pandas as pd from lightgbm import LGBMClassifier from sklearn.impute import SimpleImputer from sklearn.pipeline import make_pipeline from sklearn.model_selection import cross_val_score import matplotlib.pyplot as plt # Start the model client client = NeomarilTrainingClient() # Create an experiment training = client.create_training_experiment('Teste', 'Classification', group='datarisk') # Your variables base_path = './samples/train/' df = pd.read_csv(base_path+"/dados.csv") X = df.drop(columns=['target']) y = df[["target"]] plt.scatter(df["mean_radius"], df["mean_texture"]) # Graph Title plt.title("Relação entre mean_radius e mean_texture") # Config axis plt.xlabel("mean_radius") plt.ylabel("mean_texture") fig = plt.gcf() # Plot plt.show() # Build a pipeline pipe = make_pipeline(SimpleImputer(), LGBMClassifier(force_col_wise=True)) # log the model and save the metrics and model output with training.log_train('Teste 1', X, y) as logger: pipe.fit(X, y) logger.save_model(pipe) logger.add_extra('./extra.txt') logger.save_and_add_plot(fig, 'graphic1') model_output = pd.DataFrame({"pred": pipe.predict(X), "proba": pipe.predict_proba(X)[:,1]}) logger.save_model_output(model_output) auc = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc") f_score = cross_val_score(pipe, X, y, cv=5, scoring="f1") logger.save_metric(name='auc', value=auc.mean()) logger.save_metric(name='f1_score', value=f_score.mean()) logger.set_python_version('3.10') Checking the execution results ------------------------------ The return of the :py:meth:`neomaril_codex.training.NeomarilTrainingExperiment.run_training` is a :py:class:`neomaril_codex.training.NeomarilTrainingExecution` instace With this class we can follow the asynchronous execution of that experiment version and check information on it. .. code:: python run1.get_status() #>>> {'trainingExecutionId': '3', 'Status': 'Running', 'Message': None} run1.execution_data #>>> {'TrainingHash': 'T48c2371e453418f9859aba957de85cbcf84928d62a048b48f0259b49054a639', # 'ExperimentName': 'Teste notebook Training custom', # 'GroupName': 'datarisk', # 'ModelType': 'Classification', # 'TrainingType': 'Custom', # 'ExecutionId': 3, # 'ExecutionState': 'Running', # 'RunData': {}, # 'RunAt': '2023-05-25T17:37:07.8850840Z', # 'Status': 'Requested'} We can also download the results (model file and files saved in the `extra` key) .. code:: python run1.download_result() #>>> 2023-05-26 10:02:13.441 | INFO | neomaril_codex.base:download_result:376 - Output saved in ./output_2.zip If the model is good enough we can start the deploying process.