Linear Discriminant Analysis¶
Background¶
Linear Discriminant Analysis (LDA) is a basic classification method from parametric statistics. It is based on a maximum a posteriori estimate of the class membership under the assumption that the class conditional densities are multi-variate Gaussians having different means but a common covariance matrix. Despite its simplicity, LDA gives surprisingly good results in practice, of course crucially depending on the representation of the input patterns and the underlying distribution. Details can be found, for instance, in [HastieTibshiraniFriedman2008].
Linear Discriminant Analysis in Shark¶
We assume that a data file in csv format is provided as a command line argument. As a first step, we inspect and split the data as described in the Nearest Neighbor Classification tutorial:
int main(int argc, char **argv) {
if(argc < 2) {
cerr << "usage: " << argv[0] << " (filename)" << endl;
exit(EXIT_FAILURE);
}
// read data
ClassificationDataset data;
try {
importCSV(data, argv[1], LAST_COLUMN, ' ');
}
catch (...) {
cerr << "unable to read data from file " << argv[1] << endl;
exit(EXIT_FAILURE);
}
cout << "overall number of data points: " << data.numberOfElements() << " "
<< "number of classes: " << numberOfClasses(data) << " "
<< "input dimension: " << inputDimension(data) << endl;
// split data into training and test set
ClassificationDataset dataTest = splitAtElement(data, .5 * data.numberOfElements() );
cout << "training data points: " << data.numberOfElements() << endl;
cout << "test data points: " << dataTest.numberOfElements() << endl;
Model and learning algorithm¶
We include the LDA class
#include <shark/Algorithms/Trainers/LDA.h>
and define the LDA trainer
LDA ldaTrainer;
and the linear classifier to be trained by the LDA trainer:
LinearClassifier<> lda;
The model is trained by calling:
ldaTrainer.train(lda, data);
Evaluating the model¶
After training the model, we can evaluate it. As a performance
measure, we consider the standard 0-1 loss. The corresponding risk is
the probability of error, the empirical risk is the average
classification error. When measured on set of sample patterns, it
simply computes the fraction of wrong predictions.
We define the loss for unsigned integer
labels and
create a new data container for the predictions of our model:
Data<unsigned int> prediction;
ZeroOneLoss<unsigned int> loss;
Let’s apply the classifier to the training and the test data:
prediction = lda(data.inputs());
cout << "LDA on training set accuracy: " << 1. - loss(data.labels(), prediction) << endl;
prediction = lda(dataTest.inputs());
cout << "LDA on test set accuracy: " << 1. - loss(dataTest.labels(), prediction) << endl;
Of course, the accuracy is given by one minus the error. In this example, the (parametric) LDA performs worse than the (non-parametric) nearest neighbor classifier, but still much better than random guessing.
Full example program¶
The full example program is LDATutorial.cpp.
Sample classification problem: Protein fold prediction¶
Let us consider the same bioinformatics problem as in the
Nearest Neighbor Classification tutorial, namely the prediction of the
secondary structure of proteins. The goal is to assign a protein to
one out of 27 SCOP fold types [DingDubchak2001a]. We again consider
the descriptions of amino-acid sequences provided by
[DamoulasGirolami2008a]. The data C.csv
provide a description of the amino-acid compositions of 695 proteins
together with the corresponding fold type. Each row corresponds to a
protein. After downloading the data C.csv
we
can envoke the example program:
./LDATutorial C.csv
References¶
T. Damoulas and M. Girolami. Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection. Bioinformatics, 24(10):1264-1270, 2008.
C.H.Q. Ding and I. Dubchak. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17(4):349-358, 2001.
T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning, section 4.3. Springer-Verlag, 2008.