AutoML configuration#

{
“train_data”:{

“file_type”: “csv”, <string, Mandatory> Uploaded dataset file type. Can be csv or parquet “file_name”: “dados.csv”, <string, Mandatory> Uploaded dataset file name “sep”: “,” <string, Mandatory for CSV> Separator for the csv file.

}, “model_flow”:”classification”, <string, Mandatory> Model class, for now can be only classification “target”:”TARGET”, <string, Mandatory> Name of the target column in the uploaded dataset “cat_cols”:[“ID”], <string, Optional> Name of the columns that need to be encoded as categorical. Default is a empty list (we will try to find categorical columns) “iterations”:10, <int, Optional> How many pipelines combinations we will test. Default is 1. “metric”:”ks”, <string, Optional> Metric we will use to find the best model. For classification the options are auc, precision, recall, f1, gini, ks. Default is auc. “split_type”:”random”, <string, Optional> How we will split the training, validation and test datasets. Options are random, stratified (random but trying to get the same proportion of data between splits) and oot (validation is random, but test is split by date) .Default value is random “val_size”: 0.2, <float, Optional> Proportion of the validation dataset to the full dataset. Default is 0.2 “holdout_size”: 0.1, <float, Optional> Proportion of the test dataset to the full dataset. Only used when split_type is random or stratified. Default is 0.1 “stratify_col”: “TARGET”, <string, Optional> Which column to use to stratify the split (keeping the same proportion between splits). Only used when split_type is stratified. Default is the target column “date_col”: “DATE”, <string, Optional> Which column to use to find the most recent records. Only used when split_type is oot “oot_split_size”: 0.1, <float, Optional> Fraction of the most recent data to use as test dataset. When split_type is oot this or split_date must be informed. Default is 0.2 “split_date”: “2020-01-01”, <string, Optional> Date to filter the test dataset. When split_type is oot this or oot_split_size must be informed. “stages”:{

“models”:[“lightgbm”] <list[string], Optional> Algorithms to test. Options are logeg, catboost, xgboost, lightgbm, rf, dt. Default is use all “missing”:[“mean”] <list[string], Optional> Missing imputation methods to test. Options are mean, median, tail (replacing missing data by a value at left tail of the distribution), random and none (will only work if the algorithm alreay handle missing data). Will only be used if data has missing values. Default is use all “cleaner”:[“iqr”] <list[string], Optional> Outlier remover methods to test. Options are iqr, rare and none. Default is use all “encoding”:[“catboost”] <list[string], Optional> Categorical encoder methods to test. Options are rankcount, catboost, count and dt. Will only be used if data has categorical columns. Default is use all “preprocess”:[“none”] <list[string], Optional> Scaler methods to test. Options are norm, robust, minmax, binarizer and none. Default is use all “unbalance”:[“none”] <list[string], Optional> Target balacing methods to test. Options are smote, random_under and none. Will only be used if the minority class for the target is less than 10% of the data. Default is use all

}

}