Support Vector Machines: First Steps

Kernel-based learning algorithms such as support vector machine (SVM, [CortesVapnik1995]) classifiers mark the state-of-the art in pattern recognition. They employ (Mercer) kernel functions to implicitly define a metric feature space for processing the input data, that is, the kernel defines the similarity between observations. Kernel methods are well understood theoretically and give excellent results in practice. This tutorial explains how to train a standard C-SVM.

Theoretical background

The general supervised learning problem can be stated as follows. Given sample data \(S=\{(x_i,y_i)\,|\,1 \leq i \leq \ell\}\) drawn from an unknown distribution \(p\) over \(X \times Y\), the goal of binary classification is to infer a hypothesis \(h:X \to Y\) that minimizes the expected risk

\[R_p(h)= \int\limits_{X \times Y} L_{0-1}(y,h(x)) \, \text{d} p(x,y) ,\]

where \(L_{0-1}(y,z)\) is 0 if \(y=z\) and 1 otherwise.

Support vector machines (SVMs, [CortesVapnik1995]) transfer the input data to a feature space and perform linear classification in that space. For a positive semi-definite kernel function \(k:X \times X \to\mathbb{R}\), consider the feature space \(\mathcal H_k = {\text{span} \{k(x, \cdot) \,|\, x \in X\}}\) and function class \(\mathcal H_k^b = \{f = g + b\,|\, g \in \mathcal H_k, b\in \mathbb{R}\}\). The decision boundary induced by the sign of a function \(f \in \mathcal H_k^b\) is an affine hyperplane in \(\mathcal H_k\). 1-Norm Soft Margin C-SVMs find a solution to

\[\underset{f \in\mathcal H_k^b}{\text{minimize}} \frac{1}{2} \|f\|_2^2 + C \sum_{i=1}^\ell L_{\text{hinge}}(y_i, f(x_i))\]

with loss function \(L_{\text{hinge}}(y,f(x))=\max\{0, 1-(2y-1)f(x)\}\) for \(Y=\{0,1\}\). The parameter \(C >= 0\) controls the trade-off between reducing the empirical loss \(L_{\text{hinge}}\) and the complexity of the hypothesis \(\|.\|_2\) measured by its norm (neglecting the bias parameter \(b\)).

Here we have adopted the Shark library convention of choosing \(Y=\{0,1\}\) instead of \(Y=\{-1,1\}\). The latter is the common choice in the SVM literature. This explains the \(2y-1\) instead of a simple \(y\) in the hinge loss definition.

Support Vector Machines in Shark

For this tutorial the following include files are needed:

#include <shark/Algorithms/Trainers/CSvmTrainer.h> // the C-SVM trainer
#include <shark/Models/Kernels/GaussianRbfKernel.h> //the used kernel for the SVM
#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h> //used for evaluation of the classifier
#include <shark/Data/DataDistribution.h> //includes small toy distributions

Toy problem

In this tutorial, we consider an artificial binary benchmark classification problem shipped with the Shark library:

unsigned int ell = 500;     // number of training data point
unsigned int tests = 10000; // number of test data points

Chessboard problem; // artificial benchmark data
ClassificationDataset training = problem.generateDataset(ell);
ClassificationDataset test = problem.generateDataset(tests);

Model and learning algorithm

To build an SVM, we need a KernelClassifier and an CSvmTrainer.

To define our model, we have to choose a kernel function. Here we consider the standard Gaussian/RBF kernel

\[k(x,z) = \exp(\gamma\|x-z\|^2)\]

by writing:

double gamma = 0.5;         // kernel bandwidth parameter

GaussianRbfKernel<> kernel(gamma); // Gaussian kernel

All kernels such as the GaussianRbfKernel are derived from the base class AbstractKernelFunction.

Our model is thus a kernel classifier, which is a linear classifier in feature space:

KernelClassifier<RealVector> kc; // (affine) linear function in kernel-induced feature space

A KernelClassifier can be understood as an Classifier which converts the output of a KernelExpansion to a class label by choosing the index of the maximum output. The KernelExpansion specifies a model from \(\mathcal H_k = {\text{span} \{k(x, \cdot) \,|\, x \in X\}}\) or \(\mathcal H_k^b = \{f = g + b\,|\, g \in \mathcal H_k, b\in \mathbb{R}\}\) depending on whether a bias is used or not.

Training the machine is done by:

double C = 1000.0;          // regularization parameter
bool bias = true;           // use bias/offset parameter

CSvmTrainer<RealVector> trainer(&kernel, C, bias);

Here C denotes the regularization parameter (the SVM uses the 1-norm penalty for target margin violations by default) and bias the inclusion of a bias term in the solver.. The Shark SVM training is inspired by [ChangLin2011] but uses unique features [GlasmachersIgel2006].

Configuring the trainer further:

The above lines construct a default SVM trainer with default settings for the underlying quadratic programming optimization task. In certain cases, the user may want more fine-grained control over the behaviour of the optimizer. For example, the memory cache size of the kernel matrix cache and the stopping criterion for the solver might be varied. Consider the following lines of code:

{
        //to use "double" as kernel matrix cache type internally instead of float:
        CSvmTrainer<RealVector, double> trainer(&kernel, C, bias);
        //to keep non-support vectors after training:
        trainer.sparsify() = false;
        //to relax or tighten the stopping criterion from 1e-3 (here, tightened to 1e-6)
        trainer.stoppingCondition().minAccuracy = 1e-6;
        //to set the cache size to 128MB for double (16**6 times sizeof(double), when double was selected as cache type above)
        //or to 64MB for float (16**6 times sizeof(float), when the CSvmTrainer is declared without second template argument)
        trainer.setCacheSize( 0x1000000 );
        trainer.train(kc, training);
        std::cout << "Needed " << trainer.solutionProperties().seconds << " seconds to reach a dual of " << trainer.solutionProperties().value << std::endl;
}

The first line uses one more template parameter in this alternative trainer declaration, requesting it to use double for the matrix cache internally (instead of the default float). Note that this is only needed in very rare, mathematically sensitive cases. The second line sets the trainer to not discard non-support vectors from the solution kernel expansion after training (they are discarded by default). The third line sets the desired accuracy to a lower value (i.e., more strict value, implying longer optimization times) than the default of 1e-3. The fourth line reduces the cache size (counted in numbers of stored variables of the matrix cache type) from 512MB to 128MB (had we not passed the second template argument in the first line of this snippet, it would be a reduction from 256MB to 64MB). The fifth line is again identical to the above example. The last line illustrates the use of the solutionProperties() method to access information about the optimization run after training. For more information on available options, see the documentation of AbstractSvmTrainer, QpStoppingCondition, and QpSolutionProperties (as well as potentially of the particular SVM solver you are using, i.e., binary, multi-class, one-class, etc.).

Evaluating the model

After training the model, we can evaluate it. As a performance measure, we consider the standard 0-1 loss \(L_{0-1}(y,z)\) which we apply to training and test data:

ZeroOneLoss<unsigned int> loss; // 0-1 loss
Data<unsigned int> output = kc(training.inputs()); // evaluate on training set
double train_error = loss.eval(training.labels(), output);
cout << "training error:\t" <<  train_error << endl;
output = kc(test.inputs()); // evaluate on test set
double test_error = loss.eval(test.labels(), output);
cout << "test error:\t" << test_error << endl;

Full example program

The full example program considered in this tutorial is CSvmTutorial.cpp. Another relevant example in the examples subdirectory is the SVM model selection (see the next tutorial on Support Vector Machines: Model Selection Using Cross-Validation and Grid-Search);

References

[ChangLin2011]

C.C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.

[CortesVapnik1995] (1,2)

C. Cortes and V. Vapnik. Support-Vector Networks. Machine Learning, 20, 1995.

[GlasmachersIgel2006]
  1. Glasmachers and C. Igel. Maximum-Gain Working Set Selection for SVMs. Journal of Machine Learning Research 7, 1437-1466, 2006.