Model Evaluation Metrics
Last updated
Last updated
How many True Positives, True Negatives, False Positives, and False Negatives, are in the model above? Please enter your answer in that order, as four numbers separated by a comma and a space. For example, if your answers are 1, 2, 3, and 4, enter the string 1, 2, 3, 4
.
Remember, in the image above the blue points are considered positives and the red points are considered negatives.
6, 5, 2, 1
Sometimes in the literature, you'll see False Positives and False Negatives as Type 1 and Type 2 errors. Here is the correspondence:
Type 1 Error (Error of the first kind, or False Positive): In the medical example, this is when we misdiagnose a healthy patient as sick.
Type 2 Error (Error of the second kind, or False Negative): In the medical example, this is when we misdiagnose a sick patient as healthy.
accuracy_score function
Not always the best metric to use. Ex: this following model that says that every transaction is not fraud.
What other metrics can we use?
It's important to first see what is more important, finding the false positives, or the false negatives? For email spam, if your family's email gets sent to the junk folder = worse than if a spam email gets into your inbox. So we'll want to put more emphasis on false positives. False positives = not ok, false negatives = ok.
For a medical example, sending a sick patient home = false negative, telling a non sick person he's sick = false positive = more ok.
Ex:
Low precision, but that's ok given that we really want to avoid false negatives, false pos are ok.
Low, but it's ok since we want to avoid false positives.
The idea is that we want 1 number that combines precision and accuracy to make things simpler. However, if we take the average of the two, a lousy model with say precision of 1%, recall of 99%, will get a pretty good average for a shitty model.
So we take the harmonic mean which is always lower than the mean.
We want to calculate 2 ratios:
Now let's move the boundary around:
Then opposite boundary we get (0,0)
Now let's do this for all the possible splits and record the numbers: For the Good split:
For a perfect split (hypothetical)
Random split:
You want to measure how well your algorithms are performing on predicting numeric values? In these cases, there are three main metrics that are frequently used. mean absolute error, mean squared error, and r2 values.
As an important note, optimizing on the mean absolute error may lead to a different 'best model' than if you optimize on the mean squared error. However, optimizing on the mean squared error will always lead to the same 'best' model as if you were to optimize on the r2 value.
Again, if you choose a model with the best r2 value (the highest), it will also be the model that has the lowest (MSE). Choosing one versus another is based on which one you feel most comfortable explaining to someone else.
This is a useful metric to optimize on when the value you are trying to predict follows a skewed distribution. Optimizing on an absolute value is particularly helpful in these cases because outliers will not influence models attempting to optimize on this metric as much as if you use the mean squared error. The optimal value for this technique is the median value. When you optimize for the R2 value of the mean squared error, the optimal value is actually the mean.
The mean squared error is by far the most used metric for optimization in regression problems. Similar to with MAE, you want to find a model that minimizes this value. This metric can be greatly impacted by skewed distributions and outliers. When a model is considered optimal via MAE, but not for MSE, it is useful to keep this in mind. In many cases, it is easier to actually optimize on MSE, as the a quadratic term is differentiable. However, an absolute value is not differentiable. This factor makes this metric better for gradient based optimization algorithms.
Finally, the r2 value is another common metric when looking at regression values. Optimizing a model to have the lowest MSE will also optimize a model to have the the highest R2 value. This is a convenient feature of this metric. The R2 value is frequently interpreted as the 'amount of variability' captured by a model. Therefore, you can think of MSE, as the average amount you miss by across all the points, and the R2 value as the amount of the variability in the points that you capture with a model.
If the model is bad, it shouldn't be very different than just guessing the average of the values of the points. R2 will be close to 0 since the numerator and denominator will be similar.
If the model is good, than R2 will be close to 1.