Feature engineering[1] is used to modify data to extract the most information from the dataset
efficiently. Feature engineering can improve the performance of the algorithm on unseen testing examples.
This includes preprocessing and transforming RAW data into more sensible/ usable features.
Augmenting incomplete or missing data is one type of feature engineering.
Index | A | B | C |
---|---|---|---|
1 | 0.25 | 0.73 | 0.12 |
2 | 0.19 | 0.48 | 0.34 |
3 | 0.72 | - | 0.21 |
4 | 0.91 | - | - |
5 | 0.62 | 0.51 | - |
When we are missing feature values in the training data. There are ways of completing this data.
Usually as a rule of thumb we can use (80, 10, 10)% distribution for these sets, if we have more data we can increase the size of the testing set to (90, 5, 5) distribution.
The bias and the variance are two very important characteristics of a machine learning algorithm.
Bias - This is the assumptions that the model makes about the target function, eg. If we have a model that assumes that the target function is a linear function that is an
assumption, if this assumption is incorrect it will lead to a poor model that has underfit the data.
Variance describes how much a change in the training data effects the parameters of the model. This is usually undesirable as it means that the model is overfit to the data.
Often a high bias model will be very simple and will have a low variance and similarly if a model has high variance the bias will be low.
Since, both a high bias and a high variance are undesirable qualities, the goal is to come up with a model that minimises both[2].
If a model has a high bias it will underfit the data, to solve this we can.
The same solutions apply for high variance models but in the opposite direction. Apart from this there are a few specialised techniques that can help with high variance models
Regularisation as a technique helps in reducing variance by reducing the flexibility of the model. This allows the model to more effectively ignore noisy data and
focus on the more important bits. This is done by adding an extra penalty term also sometimes called a "shrinkage term". The shrinkage term can be comprised of any parameters
that we wish to penalise in the model.
Now, let's consider the following model.
\[ Y=\beta_0+\beta_1x_1+\beta_2x_2 ... +\beta_nx_n\]
with the corresponding loss function:
\[\mathrm{RSS} = \sum_{i = 1}^{N} (y_i - Y)^2\]
We can look at two standard forms of regularisation.
After adding the shrinkage term the loss function becomes:
\[\mathrm{RSS} + \lambda\sum_{i = 0}^{n}\beta_n^2\]
After adding the shrinkage term the loss function becomes:
\[\mathrm{RSS} + \lambda\sum_{i = 0}^{n}|\beta_n|\]
For larger and larger values of \(\lambda\) the algorithm gets penalised for larger parameter values. Forcing the values to be more closer to 0, thus reducing the overall
flexibility of the algorithm.
There are key difference between the two different regularisations in their effectiveness at reducing noise and in their interpretability[2]. Other methods include dropout, batch normalisation and early stopping.
[1] Discover Feature Engineering, machinelearningmastery.com
[2] Derumigny, Alexis; Schmidt-Hieber, Johannes. "On lower bounds for the bias-variance trade-off". arXiv. (link)
[3] Regularization in Machine Learning, towardsdatascience.com