by Zeineb Ghrib
In this post, I will show you how using automated machine learning capabilities from Prevision.io can boost your chance to move up in the leaderboard of a machine learning competition. Some would consider it as cheating, but who tells you that the winners did not use these means to end up ahead of the competition :) .
We chose for this post, the most well known machine learning worldwide platform competition: Kaggle. This platform offers many advantages: a large array of different kinds of machine learning competitions/datasets, discussion forums, a jupyter notebook environment, and notebooks shared with other kagglers.
The competition that we chose is the following House Prices - Advanced Regression Techniques: we aim to predict house prices with a large variety of house features.
We will use Prevision.io to quickly build a baseline model, and explore some analysis elements that we can use for data exploration and feature engineering ideas.
Once you are connected on your Prevision.io instance (go to the Try it now button on www.prevision.io for a free trial if you need access) click on the button on the top right of the screen of the home page, to create a new project, you can set up the name of your project and add a small description (optional):
To import the competition dataset you have to download it beforehand from the kaggle platform, then upload it in your project by clicking on the datasets tab on the left vertical bar then click on Create Dataset button.
Then select Import Dataset option and upload you dataset from your machine:
Import Dataset View
Here we don’t have much choice: we have to respect the metric that will be used to evaluate the submission:
Here it is mentioned that:
```Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.```
So the metric that we will chose is the RMSLE - Root Mean Squared Logarithmic Error
We will select the Quick “profile” and the default model selection config.
The configuration that has to be fixed is:
the corresponding metric (RMSLE)
the dataset (imported in the previous section)
the target and the Id column
Please take into account that Prevision integrates a large array of feature engineering that can be selected/unselected :
Frequency encoding: modalities are converted to their respective frequencies in the dataset
Target encoding: modalities are replaced by the average of the target, grouped by modality
Polynomial features: features based on products of existing features are created.
PCA: main components of the PCA
K-means: K-means cluster number are added as new features
Row statistics: features based on row by row counts are added as new features (number of 0, number of missing values, …)
For more information check the official documentation here
Note: Some feature engineering steps have to be done manually in the context of a competition, such as
Feature extraction - add new features from existing one (addition, substruction, division, aggregation, ...)
Advanced missing value imputation depending on the feature type & distribution
Discarding out of range data that are in the train set and not in the test set in order to force the matching of data distribution
In the first experiment, I got as a cross validation performance 0.134, you can test a first submission:
1- upload the test dataset on your project workspace
2- Go to the “Predictions” tab and launch new predictions :
Select the best model
Select the test dataset
Launch the predictions
Then download the predictions in kaggle platform to have an idea of what would be your rank with this baseline model.
I got in the first try submission, the rank 1248/4719 which is not too bad for a first try.
In the first iteration, I selected Linear Regression and Xgboost. You can create new “versions” of your experiment using other types of models: for example in the second version I selected CatBoost and Xgboost: Catboost was more performant. In the third version it was interesting to compare Catboost with lightgbm...etc
Another simple yet efficient technique to slightly increase the model performances consists in increasing the folds number.
This can be done directly in Prevision.io UI by changing the training Profile from Quick (which uses 3 folds) to Normal (4 folds) or Advanced (5 folds):
Please take into account that some feature engineering built-in transformations Statistical based encoding (freq/target encoding) or PCA/k-means based features are performed along with cross validation training technique.
Usually, the very badly classified samples are most probably wrong.You can get a cleaner dataset by dropping them, which will increase the model performances:
First you can download the cross validation of your model and extract the top 5% most declassified samples, drop them from the training dataset and re-launch the experiment.
Model Blending is a type of model stacking that many kagglers use to increase performances. The technique consists in :
training diversity models from original features of your training dataset (1-Level models),
training 2-Level models from cross validation predictions of 1-Level models,
3-Level model is obtained by averaging 2-Level models.
⇒ It usually ends up with a more efficient overall model than a single one.
3-Level Models Stacking using in Prevision.io
Remark: However keep in mind that this type of model are rarely used in a real business project because they are not fast, which is not very practical if the model is in production in a high latency intolerant systems, and especially they are not easily explainable (which is not the best for non specialists that wants to understand the results provided by model)
This step can be very tedious when done manually. Fortunately, with Prevision all you have to do is to switch on th “Blend” option in the model configuration:
Models settings in Experiment Configuration Tab
Prevision.io Execution Graph
The execution graph allows you to see the progress of the tasks in Prevision.io. It is also on this graph that you can find the 3-level models described above.
A very common technique that kagglers always use to move up in the leaderboard is called the Pseudo labelling: It consists in adding confident predicted test data to your training data. Checkout this excellent post to get more information about pseudo-labelling.
1- Select your best model found by Pervision.io
2-Predict your test submission using the confidence option: it would add interval confidence columns
3- Add confident predicted test observations to the initial training data
4- Re-launch a new experiment on the combined data
Submission Score using Prevision.io
Top 3% on House Prices Competition
By combining the advice described above and the variety of features by Prevision.io, you can achieve the score of 0.11511, allowing you to rank 123 out of 4731 (top 3%).
To repeat the Youtuber Khabane Lame’s famous gestures:
Khabane Lame Image from Deep Dream Generator
I hope you enjoyed the post! And that you’ll try out Prevision.io to participate in data science competitions. It will save you a lot of time and automate many operations that can be very painful to implement manually! Sometimes it takes me a whole weekend to barely send two submissions. With Prevision.io platform it is straightforward and you can test different experiment configurations with just a few clicks.
Prevision.io brings powerful AI management capabilities to data science users so more AI projects make it into production and stay in production. Our purpose-built AI Management platform was designed by data scientists for data scientists and citizen data scientists to scale their value, domain expertise, and impact. The platform manages the hidden complexities and burdensome tasks that get in the way of realizing the tremendous productivity and performance gains AI can deliver across your business.