Model Assessment


Lecture note for lecture 17 - 28/10/2020

Konthalapalli Hradini


Performance metrics are used to determine if an algorithm has performed well. In general, you can say an algorithm has done well if the testing accuracy is high. This implies that it generalises well.


Regression

The main performance metric for regression is calculating error on test data. You can use Mean Squared Error ($MSE$) or Mean Absolute Error ($MAE$). Suppose the predicted output of a regression model is $ \hat{y_{i}} $ and the actual output is $ y_{i} $.

Mean Squared Error $$ MSE = \frac{1}{N} \sum^N_{i=1} (y_{i} - \hat{y_{i}})^{2}$$ Mean Absolute Error $$ MAE = \frac{1}{N} \sum^N_{i=1} \vert y_{i} - \hat{y_{i}} \vert $$ $MSE$ gives more importance to outliers as compared to MAE because we are squaring the difference between the predicted and actual output. If error on the testing data is high that would mean that the model is bad at generalisation. Possible reasons and solutions for high error on testing data:


Classification

For any classification problem, we can create a confusion matrix to assess the performance of your model. In a confusion matrix, rows correspond to the ground truth (actual classes) and columns correspond to the predicted class. Here is a confusion matrix for a Spam - Not Spam classification model.


Confusion Matrix

Predicted Class
Spam Not Spam
Ground Truth Spam True Positive ($TP$) False Negative ($FN$)
Not Spam False Positive ($FP$) True Negative ($TN$)

From the confusion matrix, we can calculate precision, recall and accuracy.

Precision will tell us the proportion of actual spam emails in all the emails classified as spam. $$ Precision =\frac{TP}{TP + FP} $$ Recall will tell us the proportion of actual spam emails which where classified as spam in all the actual spam emails. Recall is also called True Positive Rate ($TPR$) $$ Recall =\frac{TP}{TP + FN} $$ Accuracy will tell you the proportion of correctly classified emails in all the classified emails. In some cases, we might require a higher rate of accuracy for one class as compared to the others or a false negative is much worse than a false positive (disease diagnosis). This cost - sensitive accuracy can be achieved by placing varying weights for different mis-classification. $$ Accuracy =\frac{TP + TN}{TP + TN + FP + FN} $$ Ideally, we would want precision and recall to be high, but that is not possible. Suppose we want to increase recall, we will need to decrease false negatives. this would mean that the false positives increase and precision decreases. So we need to decide which metric is more important based on the problem. In the case of spam detection, precision would be more important than recall. (We do not want to lose important emails to the spam filter.) In the case of airport security, recall would be more important than precision. (It is okay even if we manually check a few more bags rather than missing a threat.)

Receiver Operating Characteristic curve (ROC curve) is a plot between True Positive Rate (Recall) and False Positive Rate ($FPR$) for varying threshold values. As threshold values are varied, the number of labels that are misclassified vary. The red diagonal line corresponds to a 50% chance. If the ROC curve goes below the diagonal, your model is worse than random chance. $$ TPR = \frac{TP}{TP + FN} \quad FPR = \frac{FP}{FP +TN} $$

Sample ROC curve
Sample ROC curve

The best classification will have $FN = 0$ and $FP = 0$. This means that $TPR = 1$ and $FPR = 0$.

Area under the ROC curve (AUC) is equal to one for the best classification. For $AUC = 1$, we need (0,1) to be a part of the plot. So, $TPR = 1$ and $FPR = 0$ for best classification.


If $FPR$ is 1 why is $TPR$ also 1?

Different thresholds

In this diagram, the black circles represent negative class and white circles positive class. The vertical lines represent 3 different thresholds, everything to the right of the line is predicted as positive and to the left is predicted negative.

Consider threshold 1: $ TP = 3$, $TN =5$, $FP =1$ and $FN = 2$. $$ TPR = \frac{3}{5} \quad FPR = \frac{1}{6}$$ Consider threshold 2: $ TP = 6$, $TN =2$, $FP =4$ and $FN = 0$. $$ TPR = \frac{6}{6} \quad FPR = \frac{4}{6}$$ Consider threshold 3: $ TP = 6$, $TN =0$, $FP =6$ and $FN = 0$. $$ TPR = \frac{6}{6} \quad FPR = \frac{6}{6}$$ As we change the threshold to increase $FPR$ to 1, all the circles are classified as positive and the $TPR$ also increases to 1.


k - fold Cross Validation

k-fold cross validation can be used to tune hyperparameters, especially when the dataset is small. Suppose $ k = 5$, the dataset is divided into 5 folds, $f1, f2, f3, f4$ and $f5$. You also generate 5 different models with different hyperparameter values. For Model 1, $f1$ will be the test set and $f2+f3+f4+f5$ will the training set. For Model 2, $f2$ will be the test set and $f1+f3+f4+f5$ will the training set and so on. From the different validation scores, hyperparameters can be fixed.


Do you recall Recall?

Consider the problem of searching for documents in a library. What is recall and precision in this context?

Recall = Number of relevant documents : Number of relevant documents in the library

Precision = Number of relevant document : Number of documents retrieved


References

  1. https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce
  2. https://www.dataschool.io/roc-curves-and-auc-explained/
  3. http://fourier.eng.hmc.edu/e161/lectures/classification/node5.html
  4. Sample ROC curve taken from 'Introduction to Machine Learning with Python' A.C. Muller, S Guido