Model selection, how to choose the one that fits my need

Regression Analysis

by Arnold Zephir

This is a part 2 of a 2 part Series focused on Model Selection. Part 2 focuses on regression analysis. You can access to part 1 on Classifier Models here.

For a regression model, evaluating the fittest model, which solves the business problem the best, can be more tricky than classification.

In most cases, regression models are used to predict some amount of something and build stockpiles. For example :

• you predict some treasury needs and keep some cash

• you predict a sale and make some stocks for it

So in regression models, there are two kinds of errors :

• underestimation: your model doesn’t predict enough and you miss opportunities because you did not make enough provisions

• overestimation: you spend too much time provisioning

Let me introduce you to three ways to estimate how much gain you're going to get from your model when put in production, from simplest to best.

Option: One metric to rule them all

When training a regression model, you set an objective to fit. This objective is one number computed on the whole dataset that reflects the error of your model.

For example it could be :

MAE: Mean Absolute Error. The average of the absolute value of your errors

 True Value Your Prediction Absolute Error 150 160 10 236 230 6 845 837 8 145 138 7 MEAN ABSOLUTE ERROR ( 10 + 6 +8 + 7)/4 = 7.75

MAPE : Mean Absolute Percentage Error. The average of relative absolute error

 True Value Your Prediction Absolute Relative Error 150 160 6.66 236 230 2.54 845 837 0.94 145 138 4.8 MEAN ABSOLUTE PERCENTAGE ERROR 3.66

RMSE : Root of the Average of the Squared Error

etc..

So the fastest and easiest way to select a model is to take the one with the best metrics (in most cases it’s the lowest one), possibly taking a look at inference speed and stability. A list of models with their performances, speed and stability. Top chart is performance by model ( bar chart height) with variance ( mustache bar ). Bottom chart is a scatter plot of models response time vs their performance

Yet, this is not always suitable for real life as :

• most of the metrics are computed assuming a positive error and a negative one have the same impact

• metrics are computed for the whole dataset but maybe an error of 1 on an inventory of books does not have the same consequences as an error of  1 on a inventory of fresh meat

• metrics are computed on the trainset thanks to the cross validation file or a holdout file so it assumes you are going to have the same distribution  of inputs when predicting. Yet, you sometimes make your model on a 5 year dataset so your metrics are an evaluation of what you would have won (or lost) for the past 5 years. You should then be careful when evaluating  your model metrics for the next week, as you cannot be sure that the next week will be exactly distributed like the past 5 years

Let’s see some examples where the one metric to fit them all does not work.

Inventory

You build a model to predict the volume of items to stockpile, because the storage costs you money.

You fit your model on Root of the Mean Squared Error (RMSE).. It’s quite good because RMSE will lessen the errors on the largest volume as it is based on the sum of square :

• an underestimation of 10m3 ( error = -10 )  weighs as much in your model as an overestimation of 10m3 ( error = 10 )  but only the second one will cost you money because you are going to provision 10m3

• errors are summed so 10 errors of 3.16m3 ( 10*3.16²/10 = 99.85/10 = 9.98  ) weighs approximately  as much as one error of 10 ( 1*10²/1 = 10 ), yet maybe making an error of 10m3 will cost you way more than 10 errors of 3,16m3 because  you need to provide a higher ceiling.

Cash On Hand

You built a model to predict the amount of cash needed for the next period and fit some of the standard metrics (MAE, MAPE or RMSE whatever).

Cash on hand models are one of the rare cases where negative errors and positive errors compensate because if you stockpile too much on a day (positive errors) but not enough on the next day (negative errors), what really matters  is the sum of all your errors.

So what really matters here is the sum of your signed errors, that metrics like MAE or RMSE don’t take into account.

Sales

You built a model to predict the unit sales of food in a food truck. Making an error on meat is not the same as making an error on buns as buns can be stocked for 2 weeks while meat has a shorter shelf life.

Evaluating the gain (or loss) of your model just from the objective metrics is neither good nor easy. So, in most cases, selecting your regression model just from the Metrics used to fit it is not enough. You need to build a custom loss function with your stakeholder and apply it to your cross validation.

Option 2: Custom detailed metrics

Let us look at a better way. Each time you build a model, you probably have a cross-validation file generated ( if that’s not the case, you should ). In the Prevision.io platform, each model trained comes with a Cross Validation file, you then can get it from web UI or SDK.

A cross validation file is at least 3 columns :

• an ID to join on original dataset

• the expected value ( real value to predict )

• the predicted value by the model a typical Cross Validation file

With this file , you should build a custom loss function in order to properly evaluate your model. A loss function is a dedicated function from all your predicted values and real values. From those, it gives you one number that reflects the model performance according to your business objectives.

Let’s say it predicts  number of sales and your (very simplified) supply chain  works like this :

• each predicted items cost you \$10 (transport and storage)

• each sales earns you \$13

• and of course you can only sell an item if you have it in stock so each day you sell

• only as much as people asked you even if you had provisioned more (over estimation)

• only what you had provisioned if the model underestimated

So your total gain is :

loss= 0n min(predicted, true) * 13 - predicted*10

For example, if your model predicted 10, logistic costs 10*\$10=\$100. If you make 8 sales, your earn 8*\$13 = 104. That is a \$4 benefit.  A perfect model would have earned you \$24 instead. 68802 =min(24925; 22934)*\$13 - 22934*\$10

From the Cross Validation data above and your custom loss you can compute these KPIs:

• MAE :  5337 items error a day

• MAPE : 333.9% (coming from days with very few sales and a lot predicted. MAPE is rarely  a good metric especially when range of value to predict is wide, and even worse where actual value are close to 0 )

• RMSE : 7923

• Custom metrics : the model “gain” ( loss in fact ) is about - \$1.7 M --- less than a perfect model

From the same datas, you could build another models with the following metrics :

• MAE : 5337

• RMSE : 7923

• Custom metrics : - \$ 5.7m

We see that RMSE stays the same ( 7923 ) but the custom loss function evaluates to -\$5 .7M, A far worse loss than the previous model.

Yet with same data science metrics, we could get a far different business result.

(Remember this is not a true loss but a comparison with a perfect model if you had one. To evaluate your true gain, you should compare it with your current method. In the case above, using a basic model from statistics would have yielded a loss of - \$10M so using a good model is in fact a \$9M gain )

Data science metrics  for regression are built for mathematical and statistical reasons but should not replace your business loss function. It’s your duty as a data scientist to define a loss function with the stakeholder that reflects the real expected gain from using your models and then apply it on correctly built cross validation files in order to select the fittest model.

Yet, even with a custom function applied on a cross validation predictions file or on a holdout file, there is still the problem that the metrics hold only if your distribution is on the train set, which often spans over many years, is the same on the next week, which rarely is true.

Option 3 - Continuous error weighted prevision

For most cases, your trainset is several millions of rows of data acquired over several years or transactions.

When you compute a loss or a metric, its value is computed on the whole dataset. For example, the MAE is the mean of absolute error on each prediction.

You could build a chart like this that shows the error (or your loss function value) for each value of your prediction: Residuals vs predicted value on cross validation file

Yet, as metrics are computed on the whole set, their value depends a lot on the distribution of your prediction.

But in most cases, here is what’s happening : On the trainset, which encompasses many years  of sales, your prediction values may range from 0 to 13,000 items sold a day, for example, but on the next seven days maybe it only spans from 2,700 to 6,550 (because there are no holidays , or you do not sell some kind of article anymore, or for many other reasons that could have shifted the distribution).

So you should not expect your loss function value to be the same on average in the next weeks as what it was on the last 5 years because your distribution has probably shifted.

There are two ways to tackle this problem.

Solution 1 : the correct mathematical way

Fit some distribution on your train set target by using a gaussian mixture model, for example, then fit another one on the error distribution vs target, compute a convolution product of them and generate expected values for next week with a Monte Carlo simulation.

OR

Solution 2 : the tricky and clever way

Build your models, compute your loss function on each row of your cross validation predictions with each of your models.  Then build other models that predict your loss function value from the cross validation file and apply it on your prediction for the next week.

Use the model that has the best expectation for next week . Do that each week.

By doing so, you are sure to use the fittest model given the expected event of the next week (or next month or whatever) and based on your loss function.

This respects both your loss function definition and the current distribution of your event ( and of course, as soon as you start to use a model in production, start to monitor your distribution drift to prevent model collapse ).

Conclusion

Machine Learning and AI should always be used to solve a true problem. In order to select the fittest model, not the best, from a Data Science metrics point of view, you should compute your custom loss function as it relates to the expected gain ( or loss reduction ).

This must be done with your stake holder for them to get the model that solves their problem the best.