Lecture 15: Bias - Variance Tradeoff and Regularisation

Lecture date : 27 October, 2020 - Karan

Feature Engineering

Feature engineering[1] is used to modify data to extract the most information from the dataset efficiently. Feature engineering can improve the performance of the algorithm on unseen testing examples.
This includes preprocessing and transforming RAW data into more sensible/ usable features.

Missing Features

Augmenting incomplete or missing data is one type of feature engineering.

Index	A	B	C
1	0.25	0.73	0.12
2	0.19	0.48	0.34
3	0.72	-	0.21
4	0.91	-	-
5	0.62	0.51	-

When we are missing feature values in the training data. There are ways of completing this data.

Remove the incomplete data - Removing the incomplete data entirely eliminates the problem, this may not be feasible if the size of available data is already small
Data Imputation - guessing the missing values by putting in the mean or the median. Eg. for the 5th example above the C value could be 0.22
Train a model on the complete data points and use this to predict the missing feature values.
Completing with a neutral value from the data range. So, if the data lies in the range [-1,1] the value 0 could be used for all the missing features. This may prevent inducing a bias in the training ( assuming that the rest of the data is distributed equally around this neutral value). Eg. 0.5 can be used for the above table as the data range is [0,1]
Completing with an extreme value that is entirely outside the range of the data, so the examples with the missing features are treated differently. Eg. 2 could be used in the above table as the range for all the data is [0,1].
Along with assigning approximate values, we can also add another feature to all the data telling the algorithm whether the original data was complete or not. This adds another feature for the algorithm to look at and attempt to compensate for the missing data.

Dataset Parts

Training set - This set is used to train the algorithm parameters.
Validation set - This set is used to verify the training of the model and to tune the hyperparameters. So, if we have several versions of the model, then the validation step can help choose the best version.
Testing set - Just verifies the results of the training and the validation sets.

Usually as a rule of thumb we can use (80, 10, 10)% distribution for these sets, if we have more data we can increase the size of the testing set to (90, 5, 5) distribution.

Bias - Variance Trade off

The bias and the variance are two very important characteristics of a machine learning algorithm.
Bias - This is the assumptions that the model makes about the target function, eg. If we have a model that assumes that the target function is a linear function that is an assumption, if this assumption is incorrect it will lead to a poor model that has underfit the data.
Variance describes how much a change in the training data effects the parameters of the model. This is usually undesirable as it means that the model is overfit to the data.
Often a high bias model will be very simple and will have a low variance and similarly if a model has high variance the bias will be low. Since, both a high bias and a high variance are undesirable qualities, the goal is to come up with a model that minimises both[2].

Fixing high bias models

If a model has a high bias it will underfit the data, to solve this we can.

Make the model more complex
Add more training data as a small amount of training data may be unrepresentative of the target distribution
The features used are not informative

The same solutions apply for high variance models but in the opposite direction. Apart from this there are a few specialised techniques that can help with high variance models

Reducing the number of features being used.
Regularisation

Regularisation

Regularisation as a technique helps in reducing variance by reducing the flexibility of the model. This allows the model to more effectively ignore noisy data and focus on the more important bits. This is done by adding an extra penalty term also sometimes called a "shrinkage term". The shrinkage term can be comprised of any parameters that we wish to penalise in the model.
Now, let's consider the following model.
\[ Y=\beta_0+\beta_1x_1+\beta_2x_2 ... +\beta_nx_n\] with the corresponding loss function:
\[\mathrm{RSS} = \sum_{i = 1}^{N} (y_i - Y)^2\] We can look at two standard forms of regularisation.

L2 - Ridge regularisation

After adding the shrinkage term the loss function becomes:
\[\mathrm{RSS} + \lambda\sum_{i = 0}^{n}\beta_n^2\]

L1 - Lasso regularisation

After adding the shrinkage term the loss function becomes:
\[\mathrm{RSS} + \lambda\sum_{i = 0}^{n}|\beta_n|\]
For larger and larger values of \(\lambda\) the algorithm gets penalised for larger parameter values. Forcing the values to be more closer to 0, thus reducing the overall flexibility of the algorithm. There are key difference between the two different regularisations in their effectiveness at reducing noise and in their interpretability[2]. Other methods include dropout, batch normalisation and early stopping.

References

[1] Discover Feature Engineering, machinelearningmastery.com
[2] Derumigny, Alexis; Schmidt-Hieber, Johannes. "On lower bounds for the bias-variance trade-off". arXiv. (link)
[3] Regularization in Machine Learning, towardsdatascience.com