Linear Kernel Combinations (and a bit of MKL)

This tutorial first lists some background information on Multiple Kernel Learning (MKL) algorithms and Linear Kernel Combinations (LKCs). In the second part, we start with the actual, hands-on Shark tutorial. This includes a tour of the different kernel functions which might be handy, as well as the MKL-typical kernel normalization techniques.

Shark does not currently include a “canonical” MKL algorithm that optimizes the kernel weights and the parameters of an SVM kernel expansion jointly. Rather, it offers three kernel classes generally used in MKL algorithms. The most general of these is the MklKernel, which adds up sub-kernels that can operate on completely different input types. The SubrangeKernel lets the sub-kernels operate on different index ranges of the same input vector. Finally, the WeightedSumKernel simply passes its inputs to all sub-kernels unchanged.

What you then do with these kernels, i.e., how you choose their weights and train the resulting SVM is so far up to you.

MKL and LKCs: Background

In recent years, so-called Multiple Kernel Learning (MKL) algorithms for SVMs have become fashionable; see for example [Gonen2011] for a review. That line of research has at its core the idea of using, instead of one single kernel, a (convex) linear combination of base kernels as a compound kernel within an SVM.

In more detail: if \(k_1\) and \(k_2\) are positive definite kernel functions, then

\[k = a \cdot k_1 + b \cdot k_2\]

is again a positive definite kernel for \(a \geq 0\) and \(b \geq 0\) (and \(k \neq 0\)). Most of the MKL literature is based on linear kernel combinations (LKCs) of the general form

\[\begin{split}k = \sum_i \theta_i k_i \; , \\ \theta_i \geq 0\end{split}\]

Multiple Kernel Learning then refers to learning the positive, real-valued weight vector \(\theta\). Often, an additional constraint is enforced on the kernel weights, for example \(\|\theta\|^2 \leq 1\). It should be noted that the process of learning the kernel weights (and possibly including training the SVM at the same time) is referred to as MKL, but merely employing an LKC is not. The most prominent school of MKL algorithms share the following characteristics:

  • The kernel weights \(\theta\) are optimized together with the SVM weight vector \(\alpha\) (or \(w\) ) in one single, joint optimization problem. See [Kloft2011] or [Gonen2011].

  • The sub-kernels \(k_i\) are usually regarded as (pseudo-)parameterless. In other words, if the sub-kernels do have parameters, these are fixed to one particular value and not optimized.

Since learning the kernel weights is integrated into the (modified) main SVM problem such that it remains convex, proponents of MKL argue that MKL is a convincing way of “learning a kernel”: the weights are guaranteed to reach “the” global optimum to the optimization problem. On the other hand, just because a problem is convex does not mean that it is helpful and/or that the base kernels were selected in a helpful way.

A second school of MKL algorithms employ a two-stage process. First, the kernel weights are optimized using, for example, the kernel-target alignment as a criterion. In the second step, the full SVM is then trained as usual with fixed kernel weights.

In practice and for many applications, the experimental results of a wide range of MKL algorithms have proven not very convincing [Gonen2011].

Regardless of the kernel weight optimization strategy used and its respective success, it is instructive to recall the two (different) main motivations for using LKCs in a learning task:

  • In the first scenario, each kernel operates on the exact same set of features. That is, the input to each sub-kernel \(k_i\) is the same as the one to the “mother” kernel \(k\). The sub-kernels may then either stem from different kernel function families, or basically be the same mathematical function but with different parameters. The most popular scenario or argument made for such a setting is that this way, each sub-kernel can be viewed as a candidate kernel for solving the problem at hand. Instead of selecting the kernel family type and/or the sub-kernels’ parameters in a traditional grid-search (or other hyperparameter optimization) setting, MKL algorithms can “choose” their favorite kernel themselves, and thus the best sub-kernel parameter, by increasing the weights \(\theta_i\) for all sub-kernels \(k_i\) which have a meaningful sub-kernel parameter. This is sometimes seen by MKL proponents as eliminating or circumventing the SVM model selection problem. We will refer to this first scenario as the MKL kernel selection scenario.

  • In the second scenario, each kernel operates on a different sub-range of the input feature vectors. This is for example desired when the feature vector is a concatenation of data obtained through different methods, or reflecting different properties of the samples. For example, in image processing and computer vision, it is common practice to concatenate a color histogram and a histogram of gradients, etc. Another typical application domain is biological data, where many different ways to characterize or measure the properties of a molecule are conceivable. We will refer to this second scenario as the MKL information integration scenario.

Of course, hybrid scenarios, combining both of the above approaches, are conceiveable.

MKL and LKCs in Shark

Shark offers three classes which allow for positive linear combinations of sub-kernels: The WeightedSumKernel, the SubrangeKernel, and the MklKernel. All three will be described below. But here in short: in the WeightedSumKernel, the same input gets passed to all sub-kernels. With the SubrangeKernel, each sub-kernel operates on a certain index range of an input vector. The MklKernel allows the sub-kernels to be completely heterogenous (e.g., one operating on a custom data structure and one on a RealVector).

Throughout the tutorial, we will use the following includes and namespaces:

#include <shark/Data/Dataset.h>
#include <shark/Core/Random.h>
#include <shark/Algorithms/Trainers/NormalizeKernelUnitVariance.h>
#include <shark/Models/Kernels/GaussianRbfKernel.h>
#include <shark/Models/Kernels/WeightedSumKernel.h>
#include <shark/Models/Kernels/SubrangeKernel.h>
#include <shark/Models/Kernels/MklKernel.h>
#include <shark/Models/Kernels/LinearKernel.h>
#include <shark/Models/Kernels/DiscreteKernel.h>
#include <shark/Models/Kernels/PolynomialKernel.h>
#include <boost/fusion/algorithm/iteration/fold.hpp>
#include <boost/fusion/include/as_vector.hpp>

using namespace shark;
using namespace std;

and (almost everywhere) two two-dimensional test points like so:

// test points
RealVector x1(2);
x1(0)=2;
x1(1)=1;
RealVector x2(2);
x2(0)=-2;
x2(1)=1;

The WeightedSumKernel

The WeightedSumKernel class implements a kernel

\[k(x,z) = \frac{\sum_i \theta_i k_i(x,z)}{\sum_i \theta_i} \;\]

with the first kernel weight always fixed to one (eliminating one redundant degree of freedom). The denominator serves to normalize the kernel by the sum of the sub-kernel weights. Note that internally, the kernel weights are computed as exponentials of the externally visible parameters. To be clear: in other words, when you set a parameter vector, it will only affect the \(N-1\) last kernel weights (the first one being fixed to one [or zero, in parameter space]), and the weights will be the exponentials of what you passed as parameter vector. The latter is done to support unconstrained optimization (no matter what parameters you set, you always get positive weights). We next set up two base kernels like so:

// initialize kernels
DenseRbfKernel baseKernel1( 0.1 );
DenseRbfKernel baseKernel2( 0.01 );
std::vector< AbstractKernelFunction<RealVector> * > kernels1;
kernels1.push_back( &baseKernel1 );
kernels1.push_back( &baseKernel2 );
DenseWeightedSumKernel kernel1( kernels1 );

where DenseRbfKernel is a shorthand typedef for an template input type of RealVector. This is all needed to know to get started – with maybe one addition: the WeightedSumKernel (and in fact, all three LKC kernels presented in this tutorial) offer three further methods, setAdaptive(), setAdaptiveAll(), and isAdaptive(). These set or show whether a sub-kernel’s sub-parameters are part of the overall parameter vector:

void setAdaptive( std::size_t index, bool b = true ){...}
void setAdaptiveAll( bool b = true ) {...}
bool isAdaptive( std::size_t index ) const {...}

By default, the sub-kernels contribution to the overall parameter vector is turned off. That is, the only parameters initially visible are the \(N-1\) last kernel weights (the first one being fixed to one [or zero, in parameter space]). Lines of code say it best:

// examine initial state
std::cout << endl << " ======================= WeightedSumKernel: ======================= " << std::endl;
cout << endl << "kernel1.isAdaptive(0): " << kernel1.isAdaptive(0) << endl;
cout << "kernel1.isAdaptive(1): " << kernel1.isAdaptive(1) << endl;
cout << "kernel1.numberOfParameters(): " << kernel1.numberOfParameters() << endl;
cout << "kernel1.parameterVector(): " << kernel1.parameterVector() << endl;
cout << "kernel1.eval(x1,x2): " << kernel1.eval(x1,x2) << endl << endl;


// change something
RealVector new_params_1( kernel1.numberOfParameters() );
new_params_1(0) = 1.0;
kernel1.setParameterVector( new_params_1 );


// examine again
cout << "kernel1.parameterVector() with 1st parameter set to 1: " << kernel1.parameterVector() << endl;
cout << "kernel1.eval(x1,x2): " << kernel1.eval(x1,x2) << endl << endl;


// change something else
kernel1.setAdaptive(0,true);


// examine once more
cout << "kernel1.isAdaptive(0): " << kernel1.isAdaptive(0) << endl;
cout << "kernel1.isAdaptive(1): " << kernel1.isAdaptive(1) << endl;
cout << "kernel1.numberOfParameters(): " << kernel1.numberOfParameters() << endl;
cout << "kernel1.parameterVector(): " << kernel1.parameterVector() << endl<< endl;


// another change
kernel1.setAdaptive(0,false);
kernel1.setAdaptive(1,true);


// examining again
cout << "kernel1.isAdaptive(0): " << kernel1.isAdaptive(0) << endl;
cout << "kernel1.isAdaptive(1): " << kernel1.isAdaptive(1) << endl;
cout << "kernel1.numberOfParameters(): " << kernel1.numberOfParameters() << endl;
cout << "kernel1.parameterVector(): " << kernel1.parameterVector() << endl<< endl;


// last change
kernel1.setAdaptiveAll(true);


// last examination
cout << "kernel1.isAdaptive(0): " << kernel1.isAdaptive(0) << endl;
cout << "kernel1.isAdaptive(1): " << kernel1.isAdaptive(1) << endl;
cout << "kernel1.numberOfParameters(): " << kernel1.numberOfParameters() << endl;
cout << "kernel1.parameterVector(): " << kernel1.parameterVector() << endl;
cout << "kernel1.eval(x1,x2): " << kernel1.eval(x1,x2) << endl << endl;

The output of this should be:

kernel1.isAdaptive(0): 0
kernel1.isAdaptive(1): 0
kernel1.numberOfParameters(): 1
kernel1.parameterVector(): [1](0)
kernel1.eval(x1,x2): 0.52702

kernel1.parameterVector() with 1st parameter set to 1: [1](1)
kernel1.eval(x1,x2): 0.677265

kernel1.isAdaptive(0): 1
kernel1.isAdaptive(1): 0
kernel1.numberOfParameters(): 2
kernel1.parameterVector(): [2](1,0.1)

kernel1.isAdaptive(0): 0
kernel1.isAdaptive(1): 1
kernel1.numberOfParameters(): 2
kernel1.parameterVector(): [2](1,0.01)

kernel1.isAdaptive(0): 1
kernel1.isAdaptive(1): 1
kernel1.numberOfParameters(): 3
kernel1.parameterVector(): [3](1,0.1,0.01)
kernel1.eval(x1,x2): 0.677265

The kernel evaluations yield exactly what we would expect:

\[\begin{split}\frac{( 1.0*\exp(-0.1*16) + 1.0*\exp(-0.01*16) ) }{ ( 1.0 + 1.0 )} = 0.527020 \\ \frac{( 1.0*\exp(-0.1*16) + e*\exp(-0.01*16) ) }{ ( 1.0 + e )} = 0.677265 \; .\end{split}\]

The above should also make clear how the sub-kernels’ sub-parameters are “seen” by other Shark algorithms, for example during external parameter optimization.

The SubrangeKernel

The second LKC class is the SubrangeKernel. This is similar to the WeightedSumKernel, but tailored to the above mentioned “information integration scenario”. Before, in the “kernel selection scenario”, each sub-kernel operated on the entire, full feature vector. In the “information integration scenario”, each sub-kernel only operates on a continuous sub-set of the feature vector:

\[k(x,z) = \frac{ \sum_i \theta_i k_i(x_{b_{i}-e_{i}},z_{b_{i}-e_{i}}) } { \sum_i \theta_i } \, .\]

The index range \(b_{i}-e_{i}\) denotes the \(i\)-th continuous sub-range (inclusive beginning to exclusive end) of the overall feature vector. Naturally, we need to pass these index pairs to the SubrangeKernel for each sub-kernel. This is done during construction. First, we set up the sub-kernels as before:

DenseRbfKernel baseKernel3(0.1);
DenseRbfKernel baseKernel4(0.01);
std::vector<AbstractKernelFunction<RealVector>* > kernels2;
kernels2.push_back(&baseKernel3);
kernels2.push_back(&baseKernel4);

Next, we set up a vector of index pairs for the begin- and end-indices for each sub-kernel. The SubrangeKernel itself is constructed by passing one vector of kernels and one of indices:

std::vector< std::pair< std::size_t, std::size_t > > indcs_1;
indcs_1.push_back( std::make_pair( 0,2 ) );
indcs_1.push_back( std::make_pair( 0,2 ) );
DenseSubrangeKernel kernel2( kernels2, indcs_1 );

In fact, the SubrangeKernel inherits from the WeightedSumKernel. Thus, besides the constructor, the interfaces are identical. For starters, we let both kernels treat all features. This is equivalent to the WeightedSumKernel example above, as shown by the corresponding commands:

// examine initial state
std::cout << endl << " ======================= SubrangeKernel, full index range: ======================= " << std::endl;
cout << endl << "kernel2.isAdaptive(0): " << kernel2.isAdaptive(0) << endl;
cout << "kernel2.isAdaptive(1): " << kernel2.isAdaptive(1) << endl;
cout << "kernel2.numberOfParameters(): " << kernel2.numberOfParameters() << endl;
cout << "kernel2.parameterVector(): " << kernel2.parameterVector() << endl;
cout << "kernel2.eval(x1,x2): " << kernel2.eval(x1,x2) << endl << endl;


// change something
RealVector new_params_2( kernel2.numberOfParameters() );
new_params_2(0) = 1.0;
kernel2.setParameterVector( new_params_2 );


// examine again
cout << "kernel2.parameterVector() with 1st parameter set to 1: " << kernel2.parameterVector() << endl;
cout << "kernel2.eval(x1,x2): " << kernel2.eval(x1,x2) << endl << endl;


// change something else
kernel2.setAdaptive(0,true);


// examine once more
cout << "kernel2.isAdaptive(0): " << kernel2.isAdaptive(0) << endl;
cout << "kernel2.isAdaptive(1): " << kernel2.isAdaptive(1) << endl;
cout << "kernel2.numberOfParameters(): " << kernel2.numberOfParameters() << endl;
cout << "kernel2.parameterVector(): " << kernel2.parameterVector() << endl<< endl;


// another change
kernel2.setAdaptive(0,false);
kernel2.setAdaptive(1,true);


// examining again
cout << "kernel2.isAdaptive(0): " << kernel2.isAdaptive(0) << endl;
cout << "kernel2.isAdaptive(1): " << kernel2.isAdaptive(1) << endl;
cout << "kernel2.numberOfParameters(): " << kernel2.numberOfParameters() << endl;
cout << "kernel2.parameterVector(): " << kernel2.parameterVector() << endl<< endl;


// last change
kernel2.setAdaptiveAll(true);


// last examination
cout << "kernel2.isAdaptive(0): " << kernel2.isAdaptive(0) << endl;
cout << "kernel2.isAdaptive(1): " << kernel2.isAdaptive(1) << endl;
cout << "kernel2.numberOfParameters(): " << kernel2.numberOfParameters() << endl;
cout << "kernel2.parameterVector(): " << kernel2.parameterVector() << endl;
cout << "kernel2.eval(x1,x2): " << kernel2.eval(x1,x2) << endl << endl;

and its resulting output:

kernel2.isAdaptive(0): 0
kernel2.isAdaptive(1): 0
kernel2.numberOfParameters(): 1
kernel2.parameterVector(): [1](0)
kernel2.eval(x1,x2): 0.52702

kernel2.parameterVector() with 1st parameter set to 1: [1](1)
kernel2.eval(x1,x2): 0.677265

kernel2.isAdaptive(0): 1
kernel2.isAdaptive(1): 0
kernel2.numberOfParameters(): 2
kernel2.parameterVector(): [2](1,0.1)

kernel2.isAdaptive(0): 0
kernel2.isAdaptive(1): 1
kernel2.numberOfParameters(): 2
kernel2.parameterVector(): [2](1,0.01)

kernel2.isAdaptive(0): 1
kernel2.isAdaptive(1): 1
kernel2.numberOfParameters(): 3
kernel2.parameterVector(): [3](1,0.1,0.01)
kernel2.eval(x1,x2): 0.677265

Now we repeat the above scenario again, however with each sub-kernel operating on different feature ranges. Setting up the kernels and indices…:

DenseRbfKernel baseKernel5(0.1);
DenseRbfKernel baseKernel6(0.01);
std::vector<AbstractKernelFunction<RealVector>* > kernels3;
kernels3.push_back(&baseKernel5);
kernels3.push_back(&baseKernel6);


std::vector< std::pair< std::size_t, std::size_t > > indcs_2;
indcs_2.push_back( std::make_pair( 0,1 ) );
indcs_2.push_back( std::make_pair( 1,2 ) );
DenseSubrangeKernel kernel3( kernels3, indcs_2 );

… and again issuing the familiar commands:

// examine initial state
std::cout << endl << " ======================= SubrangeKernel partial index range: ======================= " << std::endl;
cout << endl << "kernel3.isAdaptive(0): " << kernel3.isAdaptive(0) << endl;
cout << "kernel3.isAdaptive(1): " << kernel3.isAdaptive(1) << endl;
cout << "kernel3.numberOfParameters(): " << kernel3.numberOfParameters() << endl;
cout << "kernel3.parameterVector(): " << kernel3.parameterVector() << endl;
cout << "kernel3.eval(x1,x2): " << kernel3.eval(x1,x2) << endl << endl;


// change something
RealVector new_params_3( kernel3.numberOfParameters() );
new_params_3(0) = 1.0;
kernel3.setParameterVector( new_params_3 );


// examine again
cout << "kernel3.parameterVector() with 1st parameter set to 1: " << kernel3.parameterVector() << endl;
cout << "kernel3.eval(x1,x2): " << kernel3.eval(x1,x2) << endl << endl;


// change something else
kernel3.setAdaptive(0,true);


// examine once more
cout << "kernel3.isAdaptive(0): " << kernel3.isAdaptive(0) << endl;
cout << "kernel3.isAdaptive(1): " << kernel3.isAdaptive(1) << endl;
cout << "kernel3.numberOfParameters(): " << kernel3.numberOfParameters() << endl;
cout << "kernel3.parameterVector(): " << kernel3.parameterVector() << endl<< endl;


// another change
kernel3.setAdaptive(0,false);
kernel3.setAdaptive(1,true);


// examining again
cout << "kernel3.isAdaptive(0): " << kernel3.isAdaptive(0) << endl;
cout << "kernel3.isAdaptive(1): " << kernel3.isAdaptive(1) << endl;
cout << "kernel3.numberOfParameters(): " << kernel3.numberOfParameters() << endl;
cout << "kernel3.parameterVector(): " << kernel3.parameterVector() << endl<< endl;


// last change
kernel3.setAdaptiveAll(true);


// last examination
cout << "kernel3.isAdaptive(0): " << kernel3.isAdaptive(0) << endl;
cout << "kernel3.isAdaptive(1): " << kernel3.isAdaptive(1) << endl;
cout << "kernel3.numberOfParameters(): " << kernel3.numberOfParameters() << endl;
cout << "kernel3.parameterVector(): " << kernel3.parameterVector() << endl;
cout << "kernel3.eval(x1,x2): " << kernel3.eval(x1,x2) << endl << endl;

We would now expect as outcome of the kernel computations:

\[\begin{split}\frac{( 1.0*\exp(-0.1*16) + 1.0*\exp(-0.01*0) )}{( 1.0 + 1.0 )} = 0.600948 \\ \frac{( 1.0*\exp(-0.1*16) + e*\exp(-0.01*0) )} {( 1.0 + e )} = 0.785357\end{split}\]

Both values are exactly what we get from the code output:

kernel3.isAdaptive(0): 0
kernel3.isAdaptive(1): 0
kernel3.numberOfParameters(): 1
kernel3.parameterVector(): [1](0)
kernel3.eval(x1,x2): 0.600948

kernel3.parameterVector() with 1st parameter set to 1: [1](1)
kernel3.eval(x1,x2): 0.785357

kernel3.isAdaptive(0): 1
kernel3.isAdaptive(1): 0
kernel3.numberOfParameters(): 2
kernel3.parameterVector(): [2](1,0.1)

kernel3.isAdaptive(0): 0
kernel3.isAdaptive(1): 1
kernel3.numberOfParameters(): 2
kernel3.parameterVector(): [2](1,0.01)

kernel3.isAdaptive(0): 1
kernel3.isAdaptive(1): 1
kernel3.numberOfParameters(): 3
kernel3.parameterVector(): [3](1,0.1,0.01)
kernel3.eval(x1,x2): 0.785357

The MklKernel

The third class is the MklKernel. It is similar to the WeightedSumKernel and the SubrangeKernel, except that it adds up kernels operating on possibly completely different inputs:

\[k(x,z) = \frac{ \sum_i \theta_i k_i(x_i,z_i) } { \sum_i \theta_i } \, .\]

That is, \(x_i\) and \(x_j\) (and hence \(k_i\) and \(k_j\)) are allowed to have very different structure (rather than merely being different subranges of the same input vector). The MklKernel thus allows for the most diverse information integration settings. This flexibility comes at a small price of added usage code complexity.

First, there is the question what data type the aggregated tuple of sub-inputs \(x=(x_0,x_1,...)\) should have. Shark currently supports binding macros for arbitrary structures to boost::fusion. For most purposes, it is easiest to declare a struct as a composite data type and then adapt it for boost::fusion, like so:

struct HeterogeneousInputStruct{
    shark::RealVector rv1;
    std::size_t st2;
    shark::RealVector crv3;
};

#ifndef DOXYGEN_SHOULD_SKIP_THIS
    BOOST_FUSION_ADAPT_STRUCT(
        HeterogeneousInputStruct,
        (shark::RealVector, rv1)(std::size_t, st2)(shark::RealVector, crv3)
    )
#endif /* DOXYGEN_SHOULD_SKIP_THIS */

namespace shark{
    template<>
    struct Batch< HeterogeneousInputStruct >{
        SHARK_CREATE_BATCH_INTERFACE_NO_TPL(
            HeterogeneousInputStruct,
            (shark::RealVector, rv1)(std::size_t, st2)(shark::RealVector, crv3)
        )
    };
}

Here, the first block declares the structure itself. The second block tells boost::fusion that the struct can be seen or treated as a tuple. The third block tells Shark to create a suitable batch structure (cf. the Batch tutorial) for it. (Side note: the code is in the beginning of the overall tutorial .cpp file because the two macros require to be called at global scope.)

Now that we created and announced the data structure, we fill it with data:

// set dimensions for data
std::size_t const num_samples = 2;
std::size_t const dim_nonzeros = 2;
std::size_t const max_elem_discr_kernel = 3;
std::size_t const dim_sparse = 5;
// create temporary helper container
std::vector<HeterogeneousInputStruct> data( num_samples );
// and fill it
data[0].rv1.resize( dim_nonzeros ); data[0].crv3.resize( dim_sparse); //size 5
data[1].rv1.resize( dim_nonzeros ); data[1].crv3.resize( dim_sparse); //size 5
data[0].rv1(0) = 1.0; data[0].rv1(1) = -1.0; data[0].crv3(1) = -0.5; data[0].crv3(4) = 8.0;
data[1].rv1(0) = 1.0; data[1].rv1(1) = -2.0; data[1].crv3(1) =  1.0; data[1].crv3(3) = 0.1;
data[0].st2 = 1; data[1].st2 = 2;
// and use it to create the 'real' dataset
Data<HeterogeneousInputStruct> dataset = createDataFromRange( data, 10 );

Next, we create all sub-kernels and the overall MklKernel:

//create state matrix for the discrete kernel. necessary but not so relevant
RealMatrix matK( max_elem_discr_kernel, max_elem_discr_kernel );
matK(0,0) = 0.05; matK(1,1) = 1.0;  matK(2,2) = 0.5;
matK(0,1) = matK(1,0) = 0.2; matK(0,2) = matK(2,0) = 0.4;  matK(1,2) = matK(2,1) = 0.6;
// set up base kernels
DenseRbfKernel baseKernelRV1(0.1);
DiscreteKernel baseKernelST2(matK);
DenseLinearKernel baseKernelCRV3;
MklKernel<HeterogeneousInputStruct> mkl_kernel( boost::fusion::make_vector( &baseKernelRV1, &baseKernelST2, &baseKernelCRV3) );

The first three lines provide a state matrix for the discrete kernel (basically a look-up matrix). The second three set up the three base kernels as usual. The last line finally creates the MklKernel via yet another boost::fusion command.

We now again examine the MklKernel’s state after creation:

// examine initial state
std::cout << endl << " ======================= MklKernel: ======================= " << std::endl;
cout << endl << "mkl_kernel.isAdaptive(0): " << mkl_kernel.isAdaptive(0) << endl;
cout << "mkl_kernel.isAdaptive(1): " << mkl_kernel.isAdaptive(1) << endl;
cout << "mkl_kernel.isAdaptive(2): " << mkl_kernel.isAdaptive(2) << endl;
cout << "mkl_kernel.numberOfParameters(): " << mkl_kernel.numberOfParameters() << endl;
cout << "mkl_kernel.parameterVector(): " << mkl_kernel.parameterVector() << endl;
cout << "mkl_kernel.eval( dataset.element(0), dataset.element(1) ): " << mkl_kernel.eval( dataset.element(0), dataset.element(1) ) << endl << endl;

It behaves similar to what we saw from the previous kernels. Next we make all sub-parameters (i.e., the RbfKernel’s bandwidth) adaptive and change two parameters:

// change something
mkl_kernel.setAdaptiveAll(true);
RealVector new_params_4( mkl_kernel.numberOfParameters() );
new_params_4(0) = 1.0;
new_params_4(2) = 0.2;
mkl_kernel.setParameterVector( new_params_4 );

Code to examine the outcome:

// examine effects
cout << "mkl_kernel.isAdaptive(0): " << mkl_kernel.isAdaptive(0) << endl;
cout << "mkl_kernel.isAdaptive(1): " << mkl_kernel.isAdaptive(1) << endl;
cout << "mkl_kernel.isAdaptive(2): " << mkl_kernel.isAdaptive(2) << endl;
cout << "mkl_kernel.numberOfParameters(): " << mkl_kernel.numberOfParameters() << endl;
cout << "mkl_kernel.parameterVector(): " << mkl_kernel.parameterVector() << endl;
cout << "mkl_kernel.eval( dataset.element(0), dataset.element(1) ): " << mkl_kernel.eval( dataset.element(0), dataset.element(1) ) << endl << endl;

We would expect the kernel evaluations to yield:

\[\begin{split}\frac{( 1.0*\exp(-0.1*1.0) + 1.0*0.6 + 1.0*(-0.5*1.0) )} {( 1.0 + 1.0 + 1.0 )} = 0.334946 \\ \frac{( 1.0*\exp(-0.2*1.0) + e*0.6 + 1.0*(-0.5*1.0) )} {( 1.0 + e + 1.0 )} = 0.413222\end{split}\]

Both values are exactly what we get from the code’s output:

mkl_kernel.isAdaptive(0): 0
mkl_kernel.isAdaptive(1): 0
mkl_kernel.isAdaptive(2): 0
mkl_kernel.numberOfParameters(): 2
mkl_kernel.parameterVector(): [2](0,0)
mkl_kernel.eval( dataset.element(0), dataset.element(1) ): 0.334946

mkl_kernel.isAdaptive(0): 1
mkl_kernel.isAdaptive(1): 1
mkl_kernel.isAdaptive(2): 1
mkl_kernel.numberOfParameters(): 3
mkl_kernel.parameterVector(): [3](1,0,0.2)
mkl_kernel.eval( dataset.element(0), dataset.element(1) ): 0.413222

MKL Kernel Normalization

Since many MKL formulations penalize the (\(l_p\)-) norm of the kernel weights, the optimization objective could always be improved by substituting the base kernels for a common multiple of themselves. For this reason, the (\(l_p\)-) norm is usually constrained to a certain value or value range. Similarly, rescaling of individual kernels (as opposed to changing their associated kernel weight) can influence the solution found by MKL algorithms. Canonical MKL formulations hence rely on normalization of the kernel (or data) to unit interval in feature space. Although Shark does not currently offer a canonical MKL SVM algorithm, we provide a trainer for “multiplicative normalization” of a MklKernel function (see [Kloft2011]).

In detail, the ScaledKernel wraps an existing kernel to multiply it by a fixed constant. The NormalizeKernelUnitVariance class is a Trainer which initializes this scaling factor of the ScaledKernel. To normalize a kernel to unit variance in feature space, we first create and fill an example dataset of 200 9-dimensional samples with random content:

std::size_t num_dims = 9;
std::size_t num_points = 200;
std::vector<RealVector> input(num_points);
RealVector v(num_dims);
for ( std::size_t i=0; i<num_points; i++ ) {
    for ( std::size_t j=0; j<num_dims; j++ )
        v(j) = random::uni(random::globalRng, -1,1);
    input[i] = v;
}
UnlabeledData<RealVector> rand_data = createDataFromRange( input );

Now let’s say we have the following three member kernels and want to build an LKC from them:

// declare kernels
DenseRbfKernel         unnormalized_kernel1(0.1);
DenseLinearKernel      unnormalized_kernel2;
DensePolynomialKernel  unnormalized_kernel3(2, 1.0);
// declare indices
std::vector< std::pair< std::size_t, std::size_t > > indices;
indices.push_back( std::make_pair( 0,3 ) );
indices.push_back( std::make_pair( 3,6 ) );
indices.push_back( std::make_pair( 6,9 ) );

From the first kernel, we declare a ScaledKernel, which we then normalize on the given dataset using a NormalizeKernelUnitVariance trainer:

DenseScaledKernel scale( &unnormalized_kernel3 );
NormalizeKernelUnitVariance<> normalizer;
normalizer.train( scale, rand_data );

Note that the kernel does not know about the dataset, but is influenced by it indirectly through the trainer. Now we’re done. We finally examine the results from the scaled kernel and trainer, and also re-calculate the kernel’s variance after normalization by hand to verify that it indeed is equal to 1.0:

std::cout << endl << " ======================= Kernel normalization: ======================= " << std::endl;

std::cout << endl << "Done training. Factor is " << scale.factor() << std::endl;
std::cout << "Mean                   = " << normalizer.mean() << std::endl;
std::cout << "Trace                  = " << normalizer.trace() << std::endl << std::endl;
//check in feature space
double control = 0.0;
for ( std::size_t i=0; i<num_points; i++ ) {
    control += scale.eval(input[i], input[i]);
    for ( std::size_t j=0; j<num_points; j++ ) {
        control -= scale.eval(input[i],input[j]) / num_points;
    }
}
control /= num_points;
std::cout << "Resulting variance of scaled Kernel: " << control << std::endl << std::endl;

This will result in output similar to the following (the first three lines may vary due to the randomized dataset):

Done training. Factor is 0.0677846
Mean                   = 83774.4
Trace                  = 3369.39

Resulting variance of scaled Kernel: 1

In the same way, we could also normalize the other two sub-kernels to unit variance in feature space. Then, we could correct the following code snippet to build a SubrangeKernel from three properly normalized kernels:

std::vector<AbstractKernelFunction<RealVector>* > kernels4;
kernels4.push_back( &unnormalized_kernel1 );
kernels4.push_back( &unnormalized_kernel2 );
kernels4.push_back( &scale );
DenseSubrangeKernel kernel4( kernels4, indices );

Tutorial source code

You can find the aggregated version of this tutorial’s code in examples/Supervised/MklKernelTutorial.cpp (as generated from its according .tpp file).

References

[Gonen2011] (1,2,3)
  1. Gönen, E. Alpaydin: Multiple Kernel Learning Algorithms. Journal of Machine Learning Research 12, 2011.

[Kloft2011] (1,2)
  1. Kloft, U. Brefeld, S. Sonnenburg, A. Zien: \(l_p\)-Norm Multiple Kernel Learning. Journal of Machine Learning Research 12, 2011.