Want to keep learning?

This content is taken from the The Open University & Persontyle's online course, Advanced Machine Learning. Join the course to learn more.
3.6

Support Vector Machines

Support vector machines (SVMs) are the most well known of the various kernel methods. They most commonly used for classification tasks, and we will examine only classification SVMs here. We begin by looking at the idea of an optimally separating hyperplane, followed by the linear support vector classifier (SVC) before finaly looking at the extension of SVCs to the non-linear SVM via the utilization of kernels.

Optimally Separating Hyperplane: The Separable Case

Consider the training dataset for a classification problem given in the diagram below. There are two classes, represented by green circles and red triangles. These two classes are linearly separable, in that a line can separate the two classes without any error on the training data. In fact, there are infinitely many lines that would succeed in doing this. Three such lines appear in the diagram. A simple linear classifier like the Perceptron would treat any of these lines as equally good (since they all performly equally well, in fact perfectly, on the training data).

Consider the new data points in black. From a visual inspection of the class clusters, it is plausible to think that the new data points of type A should be classified as green, while those of type B should be classified as red. Of the three lines given, lines 1 and 2 would result in classifications of these new data points that conflict with this intuitive assessment, whereas line 3 would classify in accordance with it. Clearly, line 3 is better than lines 1 and 2, in that it accords better with the shape of the data in the two classes.

We would like to be able to find some way of discovering the best linear decision boundary in whatever sense it is that line 3 is better than the others. The approach that SVMs take is to choose the line that maximizes the margins between itself and the closest data points of any class. Such a decision boundary is known as the optimally separating hyperplane (remember a hyperplane is a line in two dimensions, a plane in three, etc). Note that such an optimally separating hyperplane will by necessity be equidistant to the closest members of each class. We see below that line 3 has clearly bigger margins.

Optimally Separating Hyperplane: The Inseparable Case

Defining an optimally separating hyperplane only in the separable case is of limited value. It is rare for classes to be linearly separable. It is even rarer for us to know apriori (and therefore be able to safely assume) that the classes are linearly separable. Accordingly, we look now at how to extend the definition to cover cases when the classes are linearly inseparable.

In the separable case, the optimal separating hyperplane is given by the equation:

$arg max_{\mathbf{\beta}} \big( M(\mathbf{\beta},\mathbf{X},\mathbf{Y}) \big)$

Where the function $$M$$ returns the margin of the hyperplane specified by the parameters $$\mathbf{\beta}$$, given the training data $$\mathbf{X}$$ and $$\mathbf{Y}$$.

We will still seek to maximize the margin associated with the linear decision boundary. However, we will now allow some training points to be on the wrong side of their class margin. To do this, we introduce a cost hyper-parameter, $$C$$, and a set of slack variables, $$\langle \xi_1, \xi_2, ... , \xi_N \rangle$$ (one for each datum in the training data). The slack variable associated with a datum is 0 if the datum is on the correct side of the margin associated with its class, otherwise it is the distance it is from the margin. We will term a point, $$x_i$$, normal if its associated slack variable, $$\xi_i$$, equal 0, and abnormal otherwise.

We now define the margin of a hyperplane to be the distance between the hyperplane and the closest normal point of the training data. We also introduce a second term to the equation the optimal separating hyperplane maximizes:

$arg max_{\mathbf{\beta}} \big( M(\mathbf{\beta},\mathbf{X},\mathbf{Y}) - C\sum_{i=1}^N\xi_i \big)$

This equation includes a penalty term for abnormal points, based on the cost hyper-parameter, $$C$$, and the slack variables. We wish to maximize the margin while minimizing the cost. The result is a trade off, as we can increase the margin by treating more and more points as abnormal, but in doing so we increase the penalty associated with the set of abnormal points. Clearly the cost hyper-parameter governs the relative importance of the two terms. A low cost will lead to a large margin with many abnormal points (that is to say, many training points falling on the wrong sides of their class margins), while a high cost will lead to a smaller margin with fewer abnormal points.

Although the math is complex (see the aside below), this turns out to be a convex optimization problem with a closed-form solution. Solving the derived optimization problem leads to a binary linear classifer whose decision boundary is the optimally separating hyperplane. This is known as the support vector classifier, as it turns out that the only input features that enter in to the equation are those that are on or over the class margins (these are the ‘support’ vectors). All other input features have associated coefficients in the optimization problem that are equal to 0 and hence drop out of the calculation. Support vector classifiers have a single hyper-parameter, $$C$$, governing the cost associated with training points being on the wrong side of their margin.

Most importantly, the input feature vectors occur in the optimization equation only in the form of inner-products. This means that we can apply the kernel trick to the support vector classifier so as to obtain a non-linear version without having to explicitly calculate the non-linear feature transformations!

Advanced Aside: Interested students can see how to calculate the optimal separating hyperplane (in both the separable case and inseparable cases) in the Calculating the optimal separating hyperplane document available in the downloads section at the end of this article.

Support Vector Machines

Applying the kernel trick to support vector classifiers leads to the high-performance binary classifier known as support vector Machines (SVMs). We look only at the most common type of SVM, known as C-SVMs, which follow directly from the preceeding analysis. Like the support vector classifier, C-SVMs have the cost hyper-parameter, C, that governs the cost associated with training points falling on the wrong side of their margin. Additionally, the choice of kernel amounts to a second hyper-parameter, and most kernels are themselves parameterized that need to be specified as hyper-parameters for the C-SVM algorithm. This leads to a reasonably large number of hyper-parameters and care is required in tuning these: SVMs can perform very well if their hyper-parameters are tuned carefully, but will generally perform poorly if this is not done.

The SVM implementation we look at in the coding videos allows the choice of four kernels: The linear, radial basis, polynomial and sigmoid kernels. Two of these we have seen in the step on kernel regression - these were the polynomial kernel and the radial basis kernel, though the second was there called the Gaussian kernel.

The other two kernels are both somewhat special. The linear kernel is infact no kernel at all: It simply uses the original inner products. So there is no implicit projection to a higher dimensional feature space. The result, of course, is a linear classifier - albeit a very sophisticated one, as it delivers the optimally separating hyperplane rather than just any old linear decision boundary.

The sigmoid kernel is parameterized by $$\alpha$$ and $$c$$ (not to be confused with the cost hyper-parameter $$C$$) with the equation form:

$k(X_1,X_2)=tanh(\alpha X_1^T X_2 + c) = \frac{e^{\alpha X_1^T X_2 + c} - e^{\alpha X_1^T X_2 + c}}{e^{\alpha X_1^T X_2 + c} + e^{\alpha X_1^T X_2 + c}}$

The sigmoid kernel is only conditionally positive definite, in that certain values of $$\alpha$$ and $$c$$ will lead to it not being positive definite, which means the mathematical analysis about positive definite kernels does not apply to it. Its (once existing) popularity stems from the fact that using a sigmoid kernel with a SVM has been proven to be equivalent to a two layered perceptron neural network.

There is really no way of deciding before hand what kernel is likely to be suitable to your data, and you should examine the performance of different alternatives using validation methods. However a popular rule of thumb is to first try the linear kernel (as this is fast to train and evaluate, and if it performs acceptably well can be used) and then the radial basis kernel (since this seems to perform well on many tasks) before examining further alternatives if neither succeed in producing acceptable models.

Platt Scaling

C-SVMs do not directly calculate probability values. They do, though, calculate the distance of a feature vector from the linear decision boundary (either in the input space if we are working with a linear kernel, or the implicit feature space if we are working with a real kernel), which we will denote as $$f(X)$$. The sign of this value gives us the hard classification - i.e. which side of the decision boundary the feature vector or its implicit transformation is on: $$\hat{Y}=sign(f(X))$$.

We can seek to estimate probability values using this by means of Platt scaling. Basic Platt scaling fits a logistic regression model to the data $$\langle f(X), Y \rangle$$ generated from the SVM and the training data $$\langle X, Y\rangle$$.

One vs All Multi-Class Classification

Although SVMs are binary classifiers, they are able to perform multi-class classification by use of one vs all classification. This is a general method of transforming a binary classifier into a multi-class classifier. In the simplest case, a set of classifiers is built, each tasked with estimating whether a feature vector is a particular class or not (a binary classification problem). These results are combined to generate the final estimate.

If only one class is predicted, this can be outputted. If multiple classes, or no classes, are predicted we can examine how ‘close’ they were to being predicted. If our classifiers output probabilities, the most probable class can be outputted (as can a probability distribution, obtained by normalizing the single class probabilities). In the C-SVM case, we can obtain probabilities (see the Platt scaling section above), or we can look at which class was closest to the decision boundary (the $$f(X)$$ distance from the Platt scaling section).

Scaling for Large Data

As with other kernel methods, SVMs struggle to scale to large data. Although they have the advantage that only support vectors enter into the calculations requires for training, still it is unusual for them to viably scale to problems with large datasets. We typically do not try to use SVMs on data with more than 10000 rows.

If you are finding that your SVMs take a long time to train, or wish to attempt to use SVMs on larger data sets, note that large cost values lead to smaller margins and fewer data points on the wrong side of their margin - and hence fewer support vectors and accordingly better scalability. There are also a number of approaches that seek to allow SVMs to work with larger data sets, but we have limited experience with these methods and do not examine them here.

Other types of SVMs

SVMs can also be used for regression and density estimation - and, from the latter, anomaly detection. Note that we would not advise using SVMs for regression. It is also possible to replace or supplement the cost hyper-parameter with a parameter, $$\nu$$, that controls the proportion input features to be used as support vectors. These are known are $$\nu$$-SVMs. The purported advantage of working with the $$\nu$$ rather than the $$C$$ hyper-parameter is that $$0 \leq \nu \leq 1$$, whereas $$0 \leq C \leq \infty$$, allowing for a bounded search region for possible values. However, $$\nu$$-SVMs encounter additional complexity issues in their optimization leading to their run-time scaling worse than C-SVMs.