Model Assessment

Lecture note for lecture 17 - 28/10/2020

Konthalapalli Hradini

Performance metrics are used to determine if an algorithm has performed well. In general, you can say an algorithm has done well if the testing accuracy is high. This implies that it generalises well.

Regression

The main performance metric for regression is calculating error on test data. You can use Mean Squared Error ($MSE$) or Mean Absolute Error ($MAE$). Suppose the predicted output of a regression model is $ \hat{y_{i}} $ and the actual output is $ y_{i} $.

Mean Squared Error $$ MSE = \frac{1}{N} \sum^N_{i=1} (y_{i} - \hat{y_{i}})^{2}$$ Mean Absolute Error $$ MAE = \frac{1}{N} \sum^N_{i=1} \vert y_{i} - \hat{y_{i}} \vert $$ $MSE$ gives more importance to outliers as compared to MAE because we are squaring the difference between the predicted and actual output. If error on the testing data is high that would mean that the model is bad at generalisation. Possible reasons and solutions for high error on testing data:

Overfitting - This can be tackled by penalising, regularisation, simplifying the model, reducing the number of dimensions, adding more training data.
Training data and testing data are not from same distribution - This can be fixed by using a different dataset or pooling the training and testing sets, randomly splitting and re - training the model.

Classification

For any classification problem, we can create a confusion matrix to assess the performance of your model. In a confusion matrix, rows correspond to the ground truth (actual classes) and columns correspond to the predicted class. Here is a confusion matrix for a Spam - Not Spam classification model.

Confusion Matrix
		Predicted Class
		Spam	Not Spam
Ground Truth	Spam	True Positive ($TP$)	False Negative ($FN$)
Ground Truth	Not Spam	False Positive ($FP$)	True Negative ($TN$)

From the confusion matrix, we can calculate precision, recall and accuracy.

Precision will tell us the proportion of actual spam emails in all the emails classified as spam. $$ Precision =\frac{TP}{TP + FP} $$ Recall will tell us the proportion of actual spam emails which where classified as spam in all the actual spam emails. Recall is also called True Positive Rate ($TPR$) $$ Recall =\frac{TP}{TP + FN} $$ Accuracy will tell you the proportion of correctly classified emails in all the classified emails. In some cases, we might require a higher rate of accuracy for one class as compared to the others or a false negative is much worse than a false positive (disease diagnosis). This cost - sensitive accuracy can be achieved by placing varying weights for different mis-classification. $$ Accuracy =\frac{TP + TN}{TP + TN + FP + FN} $$ Ideally, we would want precision and recall to be high, but that is not possible. Suppose we want to increase recall, we will need to decrease false negatives. this would mean that the false positives increase and precision decreases. So we need to decide which metric is more important based on the problem. In the case of spam detection, precision would be more important than recall. (We do not want to lose important emails to the spam filter.) In the case of airport security, recall would be more important than precision. (It is okay even if we manually check a few more bags rather than missing a threat.)

Receiver Operating Characteristic curve (ROC curve) is a plot between True Positive Rate (Recall) and False Positive Rate ($FPR$) for varying threshold values. As threshold values are varied, the number of labels that are misclassified vary. The red diagonal line corresponds to a 50% chance. If the ROC curve goes below the diagonal, your model is worse than random chance. $$ TPR = \frac{TP}{TP + FN} \quad FPR = \frac{FP}{FP +TN} $$

The best classification will have $FN = 0$ and $FP = 0$. This means that $TPR = 1$ and $FPR = 0$.

Area under the ROC curve (AUC) is equal to one for the best classification. For $AUC = 1$, we need (0,1) to be a part of the plot. So, $TPR = 1$ and $FPR = 0$ for best classification.

If $FPR$ is 1 why is $TPR$ also 1?

Different thresholds

In this diagram, the black circles represent negative class and white circles positive class. The vertical lines represent 3 different thresholds, everything to the right of the line is predicted as positive and to the left is predicted negative.

Consider threshold 1: $ TP = 3$, $TN =5$, $FP =1$ and $FN = 2$. $$ TPR = \frac{3}{5} \quad FPR = \frac{1}{6}$$ Consider threshold 2: $ TP = 6$, $TN =2$, $FP =4$ and $FN = 0$. $$ TPR = \frac{6}{6} \quad FPR = \frac{4}{6}$$ Consider threshold 3: $ TP = 6$, $TN =0$, $FP =6$ and $FN = 0$. $$ TPR = \frac{6}{6} \quad FPR = \frac{6}{6}$$ As we change the threshold to increase $FPR$ to 1, all the circles are classified as positive and the $TPR$ also increases to 1.

k - fold Cross Validation

k-fold cross validation can be used to tune hyperparameters, especially when the dataset is small. Suppose $ k = 5$, the dataset is divided into 5 folds, $f1, f2, f3, f4$ and $f5$. You also generate 5 different models with different hyperparameter values. For Model 1, $f1$ will be the test set and $f2+f3+f4+f5$ will the training set. For Model 2, $f2$ will be the test set and $f1+f3+f4+f5$ will the training set and so on. From the different validation scores, hyperparameters can be fixed.

Do you recall Recall?

Consider the problem of searching for documents in a library. What is recall and precision in this context?

Recall = Number of relevant documents : Number of relevant documents in the library

Precision = Number of relevant document : Number of documents retrieved

References

https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce
https://www.dataschool.io/roc-curves-and-auc-explained/
http://fourier.eng.hmc.edu/e161/lectures/classification/node5.html
Sample ROC curve taken from 'Introduction to Machine Learning with Python' A.C. Muller, S Guido