Lecture 14- Feature Engineering (26/10/20)
How do we decide which model to use?
Why some features make better predictions than others?
In this lecture and the next, we will address these questions. The answers will lead us to the subject of "feature engineering."
Choosing the Model
The algorithm or the model we choose depends on various factors. Some of them are listed as follows:
Accuracy- Complex models with high accuracy often require more system time and memory. A simpler model may be useful by sacrificing high accuracy in many instances.
Dataset- It is generally expected that the model will depend on the type and form of data at hand. For example, a model that works well with categorical data may not work very well with numerical ones.
Some models work very well with unstructured data while others don't.
In general, the choice of model also depends on the number of features and the examples, i.e, the data itself. The distribution of data also matters. For example, a model may work very well with data having values in domain [0,1] underperforming when far outside this domain.
Explanability- It is often a good idea to choose the model based on the audience. For example, a speaker demonstrating an example of how machine learning works may prefer a simpler model for better explanability.
Speed- Some models work faster than others with specific type of data. Considering the training speed and testing speed, the user may prefer some models over others.
Memory- Some models use data in fragments in the memory while others put all data into the memory. In cases of handling large data, the user may not prefer a model that puts all the data into memory at once.
Algorithm/model development is done by researchers and experts. An user is generally not well-equipped with knowledge and expertise to customize the models to his convenience.
Instead, the user can customize the data and feed them to the existing algorithms. This process of generating new sets of data and labels (i.e, generating new features) from exisitng raw data is called "feature engineering".
Feature Engineering
Feature engineering is the craft of creating new, improved input features from the existing raw data.
Crudely, there are two primary goals to consider:
Preparing the input raw data into a data set, best compatible with the algorithm with better predictability.
Improving the overall performance of the machine learning model
There are four major steps involved in feature engineering which will be discussed below. There can be many different types and techniques for each category. In this lecture, we will briefly introduce some of them, without delving into technical details.
To list the steps:
Baseline Model
Categorical Encoding
Feature Generations
Feature Selection
Baseline Model
We start with a baseline model as a starting point and see how well it fares. Feature engineering is all about improvising the data features to extract the best predictability with sufficiently good speed.
Categorical Encoding
There are many ways to encode categorical data for modelling.
One such method is the one-hot encoding where we convert categorical values to numerical values.
One-hot encoding
This method spreads the values in a column to multiple flag columns and assigns 0 or 1 to them. These binary values express the relationship between grouped and encoded column.
For example, in a monochromatic photo, we can consider it as a list of pixel data. A certain pixel can have an accompanying data with 0 for no colour and 1 for colour.
For a coloured image, we can have multi-dimensional vectors representing multi-class colours Red, Green and Blue.
Pixel_No | Colour | R | G | B |
---|
000001 | Red | 1 | 0 | 0 |
000102 | Blue | 0 | 0 | 1 |
000803 | Green | 0 | 1 | 0 |
005004 | Red | 1 | 0 | 0 |
Feature Generation
In this stage, we generate new features from combining and improvising existing features.
Interaction
One of the easiest ways to create new features is by "interaction"- combining categorical variables.
This is a new categorical feature that can provide information about correlations between categorical variables.
Binning
Binning can be applied on both categorical and numerical data. Binning makes the model more robust and prevents overfitting, but at the cost of performance. This is useful when we do not need the excact value of the data.
For example, if the end-goal is to classify the weather as hot, warm or cold. We may not be interested in the exact temperature with great precision. A binned data with temperature in ranges of 5 degree would suffice.
The figure below is a representative image for binning process with different bin widths.
Transforming Numerical features
Some models work better when the features are normally distributed, so it might help to transform the values. Common choices for this are the square root and natural logarithm. These transformations can also help constrain outliers.
We will briefly discuss it with an example. Particularly, the Logarithm transformation helps to handle skewed data and after transformation, the distribution becomes more approximate to normal.
Example: Presented below is a binned data distribution and the square root transformation followed by the logarithm transformation. The data can be found in this
link.
Two popular methods of data scaling are normalization and standardization. We will briefly discuss them.
Normalization
Normalization (or min-max normalization) scale all values in a fixed range between 0 and 1. This transformation does not change the distribution of the feature and due to the decreased standard deviations, the effects of the outliers increases. Therefore, before normalization, it is recommended to handle the outliers.
Standardization
Standardization (or z-score normalization) scales the values while taking into account standard deviation. If the standard deviation of features is different, their range also would differ from each other. This reduces the effect of the outliers in the features.
where are the mean and standard deviation respectively. A representative image is shown below:
Feature Selection
At this step, we choose the features with the optimal predictability at good speed.
There are two points that we have to consider here:
More feature might lead to over fitting of training and validation sets. This may cause the model to be bad at generalizing new data.
More features mean longer time to train and optimize the hyperparameters. Fewer features can speed up testing but at the cost of predictive performance.
What's next?
In the next lecture, we will discuss the data imputation and regularization. That would conclude the lesson on feature engineering.
A partial implementation is available.