SK-Learn Training / Testing 

All models from the SK-Learn folder inherit the same training class. This gives the model the following training methods:

plot_learning _curve

This trains the model. It's slower than the quick train but it gives you a training curve, and the ability to stop the training early. The data that was defined in the model is split into test, train.

Parameters

  • batch_size = defualt is 100. This is the number of datapoints for each step. If the dataset size is big, a bigger batch_size is recomended.
  • scale = default is False. If set to True then a data scaler is fitted around the training data and stored. The training data is then scaled. The testing data is scaled using the same scaler that was fitted on the training data. The saved scaling tool is packaged with the model when deployed so it can be used in production.
  • scaling_tool = default is "standard". The scaling tool can also be "min max" or "normalize". However, be careful with normalizing. This is just precrocessing the data. It's not stored like other scalers as it's not fitted to any data.
  • remsample = default is False. If set to True, the training data is resampled using the SMOTE algorithm in order to address unbalanced data. This is done after the spliting of data into test and train to prevent resampled datapoints bleeding into the test data giving a false high accuracy in testing.
  • resample_ratio = default is 1. This is the ratio of one outcome to another. If left at one, this means that there will be 50% of the training data belonging to one category, and the other 50% belonging to the other (a ratio of 1 to 1)
  • early_stopping = default is False. If set to True, the training stops at a defined number of iterations. This should only be set to True after you have seen a full learning curve to see where the model is overfitting.
  • cut_off = default is 30. This is the number of iterations before the model stops training. This will vary depending on the batch_size. When a full training curve is produced, if you spot signs of where the algorithm is starting to overfit, the cut_off should be the X-value on the graph of where this is happening.

quick_train

This is pretty much the same as the learning curve. The only difference is that all the training is done in one cycle. The positive to this is that it's much quicker. The downside is that a learning curve cannot be displayed, and early stopping cannot be supported.

Parameters

  • scale = default is False. If set to True then a data scaler is fitted around the training data and stored. The training data is then scaled. The testing data is scaled using the same scaler that was fitted on the training data. The saved scaling tool is packaged with the model when deployed so it can be used in production.
  • scaling_tool = default is "standard". The scaling tool can also be "min max" or "normalize". However, be careful with normalizing. This is just precrocessing the data. It's not stored like other scalers as it's not fitted to any data.
  • remsample = default is False. If set to True, the training data is resampled using the SMOTE algorithm in order to address unbalanced data. This is done after the spliting of data into test and train to prevent resampled datapoints bleeding into the test data giving a false high accuracy in testing.
  • resample_ratio = default is 1. This is the ratio of one outcome to another. If left at one, this means that there will be 50% of the training data belonging to one category, and the other 50% belonging to the other (a ratio of 1 to 1)

show_learning _curve

This function shows the learning curve. Must only be fired if the plot_learning _curve function is fired.

Parameters

  • save = default is False. If set to True, learning curve will be saved as a file in the folder where the script is running.

show_roc _curve

This function displays the ROC curve for false positives and true positives trade-off

Parameters

  • save = default is False. If set to True, ROC curve will be saved as a file in the folder where the script is running.

evaluate_outcome

This function takes no arguements. It gets the testing data, and evaluates the model, giving accuracy and recall. The report is also cached so it can be packaged when the model is deployed.

evaluate_cross _validation

This function produces a cross validation test printing the average score. To access the list of all the scores in the cross validation, simply access the cross_val attribute.

Parameters

  • n_splits = number of folds used for cross-validation. default is 10