Linear Discriminant Analysis

Background

Linear Discriminant Analysis (LDA) is a basic classification method from parametric statistics. It is based on a maximum a posteriori estimate of the class membership under the assumption that the class conditional densities are multi-variate Gaussians having different means but a common covariance matrix. Despite its simplicity, LDA gives surprisingly good results in practice, of course crucially depending on the representation of the input patterns and the underlying distribution. Details can be found, for instance, in [HastieTibshiraniFriedman2008].

Linear Discriminant Analysis in Shark

We assume that a data file in csv format is provided as a command line argument. As a first step, we inspect and split the data as described in the Nearest Neighbor Classification tutorial:

int main(int argc, char **argv) {
        if(argc < 2) {
                cerr << "usage: " << argv[0] << " (filename)" << endl;
                exit(EXIT_FAILURE);
        }
        // read data
        ClassificationDataset data;
        try {
                importCSV(data, argv[1], LAST_COLUMN, ' ');
        }
        catch (...) {
                cerr << "unable to read data from file " <<  argv[1] << endl;
                exit(EXIT_FAILURE);
        }

        cout << "overall number of data points: " << data.numberOfElements() << " "
             << "number of classes: " << numberOfClasses(data) << " "
             << "input dimension: " << inputDimension(data) << endl;

        // split data into training and test set
        ClassificationDataset dataTest = splitAtElement(data, .5 * data.numberOfElements() );
        cout << "training data points: " << data.numberOfElements() << endl;
        cout << "test data points: " << dataTest.numberOfElements() << endl;

Model and learning algorithm

We include the LDA class

#include <shark/Algorithms/Trainers/LDA.h>

and define the LDA trainer

LDA ldaTrainer;

and the linear classifier to be trained by the LDA trainer:

LinearClassifier<> lda;

The model is trained by calling:

ldaTrainer.train(lda, data);

Evaluating the model

After training the model, we can evaluate it. As a performance measure, we consider the standard 0-1 loss. The corresponding risk is the probability of error, the empirical risk is the average classification error. When measured on set of sample patterns, it simply computes the fraction of wrong predictions. We define the loss for unsigned integer labels and create a new data container for the predictions of our model:

Data<unsigned int> prediction;
ZeroOneLoss<unsigned int> loss;

Let’s apply the classifier to the training and the test data:

prediction = lda(data.inputs());
cout << "LDA on training set accuracy: " << 1. - loss(data.labels(), prediction) << endl;
prediction = lda(dataTest.inputs());
cout << "LDA on test set accuracy:     " << 1. - loss(dataTest.labels(), prediction) << endl;

Of course, the accuracy is given by one minus the error. In this example, the (parametric) LDA performs worse than the (non-parametric) nearest neighbor classifier, but still much better than random guessing.

Full example program

The full example program is LDATutorial.cpp.

Sample classification problem: Protein fold prediction

Let us consider the same bioinformatics problem as in the Nearest Neighbor Classification tutorial, namely the prediction of the secondary structure of proteins. The goal is to assign a protein to one out of 27 SCOP fold types [DingDubchak2001a]. We again consider the descriptions of amino-acid sequences provided by [DamoulasGirolami2008a]. The data C.csv provide a description of the amino-acid compositions of 695 proteins together with the corresponding fold type. Each row corresponds to a protein. After downloading the data C.csv we can envoke the example program:

./LDATutorial C.csv

References

[DamoulasGirolami2008a]

T. Damoulas and M. Girolami. Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection. Bioinformatics, 24(10):1264-1270, 2008.

[DingDubchak2001a]

C.H.Q. Ding and I. Dubchak. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17(4):349-358, 2001.

[HastieTibshiraniFriedman2008]

T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning, section 4.3. Springer-Verlag, 2008.