Feature Engineering: Introduction
We know that unnecessary complexity in models leads to adverse effects, in that it requires more data to avoid overfitting as well as causing greater computational complexity. One cause of complexity, considered in terms of the number of parameters in a model, is the number of features. Additionally, including features that lack information about the target variable can only add noise to the model. Consequently, it makes sense to work with the smallest number of most informative features.
There are four common methods that act on this conclusion and seek to both reduce the number of features used and restrict those used to potential features that are informative. Ordered vaguely in degree of sophistication and power, these are:
- Expert/Domain knowledge
- Individual/Pairwise statistical tests
- Feature transformations
- Subset selection
The use of expert/domain knowledge is self-explanatory. We can ask experts in the system being modeled which of our features they believe are useful and importance for the task at hand. This is something that should always be done, as having an understanding of the system being modeled is an unambiguous good. It should, though, be remembered that sometimes experts can be wrong - and typically if we are seeking a data-driven method to model a system it is because experts cannot adequately model it from domain knowledge.
Individual and pairwise statistical tests are seek to evaluate the information contained in individual prospective features. We can then exclude those features that appear (relatively) uninformative. In supervised learning problems, this is typically done by means of pairwise statistical testing. In such tests, some statistical measure of each feature’s individual value in estimating the target variable is undertaken. Such measures include, for example, correlation and mutual information. In unsupervised learning, statistics of individual potential features, such as variance, could be used for the same task and so we include this possibility here for completeness. In reality, this is not common, except in the most degenerate of cases (there is no value including a constant, for example).
As well as the limitations of the statistical measurements chosen, a problem with pairwise statistical testing is that some features may provide little information about the target variable when considered individually, but contain large amounts of information when combined with other features.
Jumping to the last alternative, for simple models we can employ subset selection techniques whereby models are generated from subsets of the features and evaluated as per regular model evaluation. Exhaustive evaluation of subsets become impossible with more than approximately 30 features, but greedy algorithms, such as forward-backward subset selection can be used with any number of features. Since such an approach requires training a number of models that is linear on the number of features it is impracticle for sophisticated model types that take considerable time to train.
The final alternative, feature transformation, is the most interesting. Feature transformation techniques encode the information contained in the entire set of prospective features in some new set of variables. They can be divided into transformations that reduce the dimensionality of the features and those that do not. The former group, known dimensionality reduction techniques seeks to encode the information contained in the entire set of prospective features in some smaller subset (of lower dimensionality - hence the name) and are very common when working with large numbers of features and complex models. Typically such an encoding will lose some information, but the hope is that this loss will be outweighted by the benefits stemming from the reduction in complexity in the resulting models and the likely significant reduction in noise in the resulting features. In subsequent steps we will look at a number of common dimensionality reduction techniques and discuss their applicability to different problems and data sets.
Scaling and centering
Examples of feature transformation methods that do not reduce the dimensionality of the feature set are scaling and centering. Centering simply subtracts the mean of each real valued feature from each of its values, and scaling divides such a feature’s values by its standard deviation. Scaling is typically a good idea, and can be considered simply good housekeeping. It places each feature on equal footing, removing any dissimilar treatment of the features by algorithms that may result from what is afterall an arbitrary choice of unit. Centering too is often good practice, though sometimes undesirable.
Below are graphs of two variables, income and education, where the first is on a scale of 0-100 and the second on a scale of approximately 8-15. You can inspect how scaling and centering changes these variables. Note that ‘aspect adjustment’ refers to the graph axis (not the variable) being scaled to provide equal size.
A particular danger of dimensionality reduction techniques (and pairwise statistical tests) is unintended information leakage. This occurs when information about the validation and/or test data is given to the models during the training stage, thereby biasing the resulting estimates of the model’s performance on new data.
In supervised learning problems where validation techniques are used, leakage will occur if the encoding is performed on the entire labelled dataset rather than just the training data. Essentially, this causes the model to work with an encoding that it knows to be suitable for what was supposed to be new validation and test data. Accordingly, all feature transformations must be calculated solely using the training data, and then applied to the validation and test data.
The same is true of pairwise statistical tests. They should be performed on the training data, and the features excluded removed from the training, validation and test data.
© Dr Michael Ashcroft