Kernels

Shark offers a large library of positive semi-definite kernels, that is, functions inducing reproducing kernel Hilbert spaces [Aronszajn1950]. They underlie the “kernel trick” [Schölkopf2002], which is used, for instance, in non-linear support vector machines (SVMs).

Given some set \(\mathcal X\), a positive semi-definite kernel \(k:\mathcal X\times\mathcal X\to\mathbb R\) is a symmetric function for which

\[\sum_{i=1}^N\sum_{j=1}^N a_i a_j k(x_i, x_j) \ge 0\]

for all \(N\), all \(x_1,...,x_N\in\mathcal X\), and all \(a_1,...,a_N\in\mathbb R\).

A kernel \(k\) on \(\mathcal X\) corresponds to a scalar product in a dot product space \(\mathcal H\), the so called feature space:

\[k(x,y) = \langle \phi(x),\phi(y) \rangle_{\mathcal H}\]

where \(x\) and \(y\) are elements of \(\mathcal X\) , \(\phi\) is a map from \(\mathcal X\) to \(\mathcal H\), and \(\langle \cdot, \cdot \rangle_{\mathcal H}\) is the scalar product in \(\mathcal H\). For details we refer to [Aronszajn1950] and [Mercer1909].

List of Classes

The list of kernels is available in the class documentation. Functions for optimizing kernel functions are given here.

Background

Many machine learning algorithms can be written in a way that the only operations involving input elements are scalar products between those elements. A common strategy in machine learning is to map the input data into a feature space \(\mathcal H\) and to do the learning in this feature space. If the only operations in \(\mathcal H\) are scalar products, these can be replaced by kernel function evaluations rendering explicit computations of the mapping \(\phi\) to feature space unnecessary. This has some advantages:

  • Typically, the kernel can be computed more efficiently than the scalar product itself. This allows for working in very high-dimensional feature spaces.

  • The kernel provides a clean interface between general and problem specific aspects of the learning machine.

Thus, the “kernel trick” allows efficient formulation of nonlinear variants of any algorithm that can be expressed in terms of dot products. The choice of the kernel function is crucial for the performance of the machine learning algorithm.

The generic distance between to points mapped to a kernel-induced feature space is given by

\[d(x,y) = \sqrt{\langle \phi(x)-\phi(y), \phi(x)-\phi(y) \rangle_{\mathcal H}} =\sqrt{k(x,x) - 2k(x,y) + k(y,y)}\]

where \(d\) is the distance between the points \(x\) and \(y\). We call a kernel normalized, if \(k(x,x)=1\) for all \(x\). In this case calculating the distance reduces to \(d(x,y) =\sqrt{2 - 2k(x,y)}\).

The base class ‘AbstractKernelFunction<InputTypeT>’

Shark provides strong support for kernel-based algorithms. All kernel functions’ base class is the AbstractKernelFunction. A linear combination of kernels is represented in Shark as a KernelExpansion

\[\sum_{i=1}^N \alpha_i k(x_i, . ) + b\]

with \(x_1,...,x_N\in\mathcal X\), \(\alpha_1,...,\alpha_N\in\mathbb R\), and optional bias/offset parameter \(b\in\mathbb R\).

Many kernel-based algorithms need to repeatedly evaluate the kernel on some training data points \(x_1,\dots,x_N\) or they operate on the kernel (Gram) matrix \(K\) with entries \(K_{ij}=k(x_i,x_j)\) directly. To save computation time, the matrix \(K\) would be stored in memory. Depending on the hardware, even training sets with a few hundred thousand can make this prohibitive. Therefore, often only parts of \(K\) are calculated at a time, most often matrix rows or blocks. In Shark, the classes KernelMatrix and CachedMatrix as well as some derived and sibling classes encapsulate kernel Gram matrices.

Types

First, we introduce the templated types of a Kernel, which are all inferred from the only template argument InputType using several metafunctions. As in the Models, we have the InputType, and the BatchInputType, which is a batch of inputs. In contrast to Models, we also introduce special reference types:

Types

Description

InputType

Argument type of the kernel

BatchInputType

Batch of arguments; same as Batch<InputType>::type

ConstInputReference

Constant reference to InputType as returned by ConstProxyReference<InputType>::type; by default this is InputType const&

ConstBatchInputReference

Constant reference to BatchInputType as returned by ConstProxyReference<BatchInputType>::type

The reason for the ConstBatchInputReference and ConstInputReference types is that we want to make use of the structure of the arguments to prevent unnecessary copying: consider a common case when only single elements of a batch of data are to be computed. If the batch type then is a matrix, the argument will be a row of this matrix, and not a vector. Thus, the argument would be automatically copied into a temporary vector, which is then in turn fed into the kernel. This is of course unnecessary, and for fast kernels, the copying can exceed the running time of a kernel evaluation. Thus we use proxy references for vectors, which simply treat matrix rows and vectors in the same way. This optimization right now only works for the class of dense vectors and not for example sparse vectors or even more complex types.

Todo

implications of this? is there a task in the tracker? etc.

Flags

Like a Model, every kernel has a set of flags and convenience access functions which indicate the traits and capabilities of the kernel:

Flag and accessor function name

Description

HAS_FIRST_PARAMETER_DERIVATIVE, hasFirstParameterDerivative

If set, the kernel can evaluate the first derivative w.r.t its parameters

HAS_FIRST_INPUT_DERIVATIVE, hasFirstInputDerivative

If set, the kernel can evaluate the first derivative w.r.t its left input parameters; This is no restriction, since kernel functions are symmetric

IS_NORMALIZED, isNormalized

For all \(x\) it holds \(k(x,x)=1\)

SUPPORTS_VARIABLE_INPUT_SIZE, supportsVariableInputSize

Between different calls to \(k(x,y)\) the number of dimensions of the kernel is allowed to vary; this is needed for kernel evaluation of inputs with missing features

Evaluation

Next, we introduce the functions evaluating kernels. We have three types of functions. The first version simply calculates the kernel value given two inputs. The second computes the kernel evaluation of two batches of inputs. Here, the inner product between all points of the first and second batch is calculated in Hilbert space. Thus, the resulting type is a matrix of inner products – a block of the kernel Gram matrix. The third version takes two batches as well but also a state object. The state is a data structure which allows the kernel to store intermediate results of the evaluation of the kernel values. These can later be reused in the computation of the derivatives. Thus, when derivatives are to be computed, this latter version must be called beforehand to fill the state object with the correct values. There is no version of the derivative with two single inputs, because this is a rare use case. If still needed, batches of size one should be used. The reason for the state object being external to the kernel class is that this design allows for concurrent evaluation of the kernel from different threads, with each thread holding its own state object.

With this in mind, we now present the list of functions for eval, including the convenience operator(). Let in the following I be a ConstInputReference and B a ConstBatchInputReference.

Method

Description

double eval(I x, I z)

Calculates \(k(x,z)\)

void eval(B X, B Z, RealMatrix& K)

Calculates \(K_{ij}=k(x_i,z_j)\) for all elements \(x_i\) of X and \(z_j\) of Z

void eval(B X, B Z, RealMatrix& K, State& )

Calls eval(X,Z,K) while storing intermediate results needed for the derivative functions

double operator()(I x, I z)

Calls eval(x,z)

RealMatrix operator()(B X, B Z)

Calls eval(X,Z,K) and returns K.

For a kernel, it is sufficient to implement the batch version of eval that stores the state, since all other functions can rely on it. However, if speed is relevant, all three eval functions should be implemented in order to avoid unnecessary copy operations.

Distances

As outlined before, kernels can also be used to compute distances between points in \(\mathcal H\):

Method

Description

double featureDistanceSqr(I x, I z)

Returns the squared distance between x and z

double featureDistance(I x, I z)

Returns the distance between x and z.

RealMatrix featureDistanceSqr(B X, B Z)

Returns the squared distances between all points in X to all points in Z.

Derivatives

Some Kernels are differentiable with respect to their parameters. This can for example be exploited in gradient-based optimization of these parameters, which in turn amounts to a computationally efficient way of finding a suitable space \(\mathcal H\) in which to solve a given learning problem. Further, if the input space is differentiable as well, even the derivative with respect to the inputs can be computed.

The derivatives are weighted as outlined in Shark Conventions for Derivatives. The parameter derivative is a weighted sum of the derivatives of all elements of the block of the kernel matrix. The input derivative has only weights for the inputs of the right argument.

Todo

math here? mt: yes please! :)

The methods for evaluating the derivatives are:

Method

Description

weightedParameterDerivative

Computes the weighted derivative of the parameters over all elements of a block of the kernel Gram matrix.

weightedInputDerivative

Computes the derivative with respect of the left argument, weighting over all right arguments.

Putting everything together, we can calculate the derivative of a kernel like this:

BatchInputType X; //first batch of inputs
BatchInputType Y; //second batch of inputs
RealMatrix K;     //resulting part of the kernel Gram matrix
MyKernel kernel;  //the differentiable kernel

// evaluate K for X and Y, store the state
boost::shared_ptr<State> state = kernel.createState();
kernel.eval(X, Y, result, *state);

// somehow compute some weights and calculate the parameter derivative
RealMatrix weights = someFunction(result, X, Y);
RealVector derivative;
kernel.weightedParameterDerivative(X, Y, weights, *state, derivative);

Todo

i think we need some more explanation on the expected size of weights, especially since we don’t have type checks in the code of weightedParameterDerivative (maybe these should be added, too). in any case, the workings of weightedParameterDerivative should be explained more, or link to some tutorial where this is done.

Other

Kernels support several other concepts. They have parameters, can be serialized and have an external state object.

Method

Description

numberOfParameters

The number of parameters which can be optimized

parameterVector

Returns the current parameters of the kernel object

setParameterVector

Sets the new parameter vector

createState

Returns a newly created State object holding the state to be stored in eval

References

[Aronszajn1950] (1,2)

Aronszajn, N. Theory of Reproducing Kernels. Transactions of the American Mathematical Society 68 (3): 337–404, 1950.

[Mercer1909]

Mercer, J. Functions of positive and negative type and their connection with the theory of integral equations. In Philosophical Transactions of the Royal Society of London, 1909.

[Schölkopf2002]

Schölkopf, B. and Smola, A. Learning with Kernels. MIT Press, 2002.