Brereton Chemometrics
.pdfPATTERN RECOGNITION |
|
|
|
|
241 |
|||||
|
|
|
|
|
|
|
||||
|
|
|
Table 4.26 Mahalanobis distances from class centroids. |
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Object |
Distance to |
Distance to |
True |
Predicted |
|||
|
|
|
|
Class A |
|
Class B |
classification |
classification |
||
|
|
|
|
|
|
|
|
|
|
|
1 |
1.25 |
|
5.58 |
A |
A |
|||||
2 |
0.13 |
|
3.49 |
A |
A |
|||||
3 |
0.84 |
|
2.77 |
A |
A |
|||||
4 |
1.60 |
|
3.29 |
A |
A |
|||||
5 |
1.68 |
|
1.02 |
A |
B |
|||||
6 |
1.54 |
|
0.67 |
A |
B |
|||||
7 |
0.98 |
|
5.11 |
A |
A |
|||||
8 |
1.36 |
|
5.70 |
A |
A |
|||||
9 |
2.27 |
|
3.33 |
A |
A |
|||||
10 |
2.53 |
|
0.71 |
B |
B |
|||||
11 |
2.91 |
|
2.41 |
B |
B |
|||||
12 |
1.20 |
|
1.37 |
B |
A |
|||||
13 |
5.52 |
|
2.55 |
B |
B |
|||||
14 |
1.57 |
|
0.63 |
B |
B |
|||||
15 |
2.45 |
|
0.78 |
B |
B |
|||||
16 |
3.32 |
|
1.50 |
B |
B |
|||||
17 |
2.04 |
|
0.62 |
B |
B |
|||||
18 |
1.21 |
|
1.25 |
B |
A |
|||||
19 |
1.98 |
|
0.36 |
B |
B |
|||||
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
|
|
|
7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
4 |
|
9 |
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
||
3 |
|
|
|
3 |
|
|
|
13 |
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
11 |
|
|
||
|
|
|
|
|
|
|
|
|
|
|
2 |
|
|
|
12 |
|
|
16 |
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
|
|
|
|
|
|
1 |
|
|
|
5 |
|
15 |
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
6 |
|
|
|
|
|
|
|
|
|
|
17 |
10 |
|
|
|
|
|
|
|
|
|
14 |
|
|
|
|
|
|
|
|
|
|
|
19 |
|
|
|
|
|
0
0 |
1 |
2 |
3 |
4 |
5 |
6 |
Figure 4.31
Class distance plot: horizontal axis = class A, vertical axis = class B
242 |
CHEMOMETRICS |
|
|
4.5.2.4 Extending the Methods
There are numerous extensions to the basic method in the section above, some of which are listed below.
•Probabilities of class membership and class boundaries or confidence intervals can be constructed, assuming a multivariate normal distribution.
•Discriminatory power of variables. This is described in a different context in Section 4.5.3, in the context of SIMCA, but similar parameters can be obtained for any method of discrimination. This can have practical consequences; for example, if variables are expensive to measure it is possible to reduce the expense by making only the most useful measurements.
•The method can be extended to any number of classes. It is easy to calculate the Mahalanobis distance to several classes, and determine which is the most appropriate classification of an unknown sample, simply by finding the smallest class distance. More discriminant functions are required, however, and the computation can be rather complicated.
•Instead of using raw data, it is possible to use the PCs of the data. This acts as a form of variable reduction, but also simplifies the distance measures, because the variance–covariance matrix will only contain nonzero elements on the diagonals. The expressions for Mahalanobis distance and linear discriminant functions simplify dramatically.
•One interesting modification involves the use of Bayesian classification functions. Such concepts have been introduced previously (see Chapter 3) in the context of signal analysis. The principle is that membership of each class has a predefined (prior) probability, and the measurements are primarily used to refine this. If we are performing a routine screen of blood samples for drugs, the chances might be very low (less than 1 %). In other situations an expert might already have done preliminary tests and suspects that a sample belongs to a certain class, so the prior probability of class membership is much higher than in the population as a whole; this can be taken into account. Normal statistical methods assume that there is an equal chance of membership of each class, but the distance measure can be modified as follows:
diA = (x − xA).C−A1.(xi − xA) + 2 ln(qA)
where qA is the prior probability of membership of class A. The prior probabilities of membership of all relevant classes must add up to 1. It is even possible to use the Bayesian approach to refine class membership as more data are acquired. Screening will provide a probability of class membership which can then be used as prior probability for a more rigorous test, and so on.
•Several other functions such as the quadratic and regularised discriminant functions have been proposed and are suitable under certain circumstances.
It is always important to examine each specific paper and software package for details on the parameters that have been calculated. In many cases, however, fairly straightforward approaches suffice.
PATTERN RECOGNITION |
243 |
|
|
4.5.3 SIMCA
The SIMCA method, first advocated by the S. Wold in the early 1970s, is regarded by many as a form of soft modelling used in chemical pattern recognition. Although there are some differences with linear discriminant analysis as employed in traditional statistics, the distinction is not as radical as many would believe. However, SIMCA has an important role in the history of chemometrics so it is important to understand the main steps of the method.
4.5.3.1 Principles
The acronym SIMCA stands for soft independent modelling of class analogy. The idea of soft modelling is illustrated in Figure 4.32. Two classes can overlap (and hence are ‘soft’), and there is no problem with an object belonging to both (or neither) class simultaneously. In most other areas of statistics, we insist that an object belongs to a discrete class, hence the concept of hard modelling. For example, a biologist trying to determine the sex of an animal from circumstantial evidence (e.g. urine samples) knows that the animal cannot simultaneously belong to two sexes at the same time, and a forensic scientist trying to determine whether a banknote is forged or not knows
Figure 4.32
Two overlapping classes
244 |
CHEMOMETRICS |
|
|
that there can be only one true answer: if this appears not to be so, the problem lies with the quality of the evidence. The philosophy of soft modelling is that, in many situations in chemistry, it is entirely legitimate for an object to fit into more than one class simultaneously, for example a compound may have an ester and an alkene group, and so will exhibit spectroscopic characteristics of both functionalities, so a method that assumes that the answer must be either an ester or an alkene is unrealistic. In practice, though, it is possible to calculate class distances from discriminant analysis (Section 4.5.2.3) that are close to two or more groups.
Independent modelling of classes, however, is a more useful feature. After making a number of measurements on ketones and alkenes, we may decide to include amides in the model. Figure 4.33 represents the effect of this additional class which can be added independently to the existing model without any changes. In contrast, using classical discriminant analysis the entire modelling procedure must be repeated if extra numbers of groups are added, since the pooled variance–covariance matrix must be recalculated.
4.5.3.2 Methodology
The main steps of SIMCA are as follows.
Principal Components Analysis
Each group is independently modelled using PCA. Note that each group could be described by a different number of PCs. Figure 4.34 represents two groups, each
Figure 4.33
Three overlapping classes
PATTERN RECOGNITION |
245 |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 4.34
Two groups obtained from three measurements
characterised by three raw measurements, e.g. chromatographic peak heights or physical properties. However, one group falls mainly on a straight line, defined as the first principal component of the group, whereas the other falls roughly on a plane whose axes are defined by the first two principal components of this group. When we perform discriminant analysis we can also use PCs prior to classification, but the difference is that the PCs are of the entire dataset (which may consist of several groups) rather than of each group separately.
It is important to note that a number of methods have been proposed for determining how many PCs are most suited to describe a class, which have been described in Section 4.3.3. The original advocates of SIMCA used cross-validation, but there is no reason why one cannot ‘pick and mix’ various steps in different methods.
Class Distance
The class distance can be calculated as the geometric distance from the PC models; see the illustration in Figure 4.35. The unknown is much closer to the plane formed than the line, and so is tentatively assigned to this class. A more elaborate approach is often employed in which each group is bounded by a region of space, which represents 95 % confidence that a particular object belongs to a class. Hence geometric class distances can be converted to statistical probabilities.
Modelling Power
The modelling power of each variable for each separate class is defined by
Mj = 1 − sjresid /sjraw
246 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
CHEMOMETRICS |
||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 4.35
Class distance of unknown object (represented by an asterisk) to two classes in Figure 4.34
where sjraw is the standard deviation of the variable in the raw data and sjresid the standard deviation of the variable in the residuals given by
E = X − T .P
which is the difference between the observed data and the PC model for the class. This is illustrated in Table 4.27 for a single class, with the following steps.
1.The raw data consist of seven objects and six measurements, forming a 7 × 6 matrix. The standard deviation of all six variables in calculated.
2.PCA is performed on this data and two PCs are retained. The scores (T) and loadings (P) matrices are computed.
3.The original data are estimated by PCA by multiplying T and P together.
4.The residuals, which are the difference between the datasets obtained in steps 1 and 3, are computed (note that there are several alternative ways of doing this). The standard deviation of the residuals for all six variables is computed.
5.Finally, using the standard deviations obtained in steps 1 and 4, the modelling power of the variables for this class is obtained.
The modelling power varies between 1 (excellent) and 0 (no discrimination). Variables with M below 0.5 are of little use. In this example, it can be seen that variables 4 and 6 are not very helpful, whereas variables 2 and 5 are extremely useful. This information can be used to reduce the number of measurements.
PATTERN RECOGNITION |
247 |
|
|
Table 4.27 Calculation of modelling power in SIMCA.
1. |
Raw data |
24.990 |
122.326 |
68.195 |
|
28.452 |
105.156 |
48.585 |
|
|
|
||||||
|
|
14.823 |
51.048 |
57.559 |
|
21.632 |
62.764 |
26.556 |
|
|
28.612 |
171.254 |
114.479 |
|
43.287 |
159.077 |
65.354 |
|
|
41.462 |
189.286 |
113.007 |
|
25.289 |
173.303 |
50.215 |
|
|
5.168 |
50.860 |
37.123 |
|
16.890 |
57.282 |
35.782 |
|
|
20.291 |
88.592 |
63.509 |
|
14.383 |
85.250 |
35.784 |
|
|
31.760 |
151.074 |
78.699 |
|
34.493 |
141.554 |
56.030 |
sjraw |
10.954 |
51.941 |
26.527 |
|
9.363 |
43.142 |
12.447 |
|
2. |
PCA |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Scores |
|
Loadings |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
185.319 |
0.168 |
0.128 |
−0.200 |
|
|
|
||
103.375 |
19.730 |
0.636 |
−0.509 |
|
|
|
||
273.044 |
9.967 |
0.396 |
0.453 |
|
|
|
||
288.140 |
−19.083 |
0.134 |
0.453 |
|
|
|
||
92.149 |
15.451 |
0.594 |
−0.023 |
|
|
|
||
144.788 |
3.371 |
0.229 |
0.539 |
|
|
|
||
232.738 |
−5.179 |
|
|
|
|
|
|
|
3. Data estimated by PCA |
|
|
|
|
|
|
||
|
|
23.746 |
117.700 |
73.458 |
|
24.861 |
110.006 |
42.551 |
|
|
9.325 |
55.652 |
49.869 |
|
22.754 |
60.906 |
34.313 |
|
|
33.046 |
168.464 |
112.633 |
|
41.028 |
161.853 |
67.929 |
|
|
40.785 |
192.858 |
105.454 |
|
29.902 |
171.492 |
55.739 |
|
|
8.739 |
50.697 |
43.486 |
|
19.316 |
54.342 |
29.436 |
|
|
17.906 |
90.307 |
58.859 |
|
20.890 |
85.871 |
34.990 |
|
|
30.899 |
150.562 |
89.814 |
|
28.784 |
138.280 |
50.536 |
4. |
Residuals |
1.244 |
4.626 |
−5.263 |
3.591 |
−4.850 |
6.034 |
|
|
|
|||||||
|
|
5.498 |
−4.604 |
7.689 |
|
−1.122 |
1.858 |
−7.757 |
|
|
−4.434 |
2.790 |
1.846 |
|
2.259 |
−2.776 |
−2.575 |
|
|
0.677 |
−3.572 |
7.553 |
|
−4.612 |
1.811 |
−5.524 |
|
|
−3.570 |
0.163 |
−6.363 |
−2.426 |
2.940 |
6.346 |
|
|
|
2.385 |
−1.715 |
4.649 |
|
−6.508 |
−0.621 |
0.794 |
|
|
0.861 |
0.512 |
−11.115 |
5.709 |
3.274 |
5.494 |
|
|
sjresid |
3.164 |
3.068 |
6.895 |
|
4.140 |
2.862 |
5.394 |
5. |
Modelling power |
|
|
|
|
|
|
|
|
|
0.711 |
0.941 |
0.740 |
|
0.558 |
0.934 |
0.567 |
|
|
|
|
|
|
|
|
|
Discriminatory Power
Another measure is how well a variable discriminates between two classes. This is distinct from modelling power – being able to model one class well does not necessarily imply being able to discriminate two groups effectively. In order to determine this, it is necessary to fit each sample to both class models. For example, fit sample 1 to the PC model of both class A and class B. The residual matrices are then calculated, just as for discriminatory power, but there are now four such matrices:
1.samples in class A fitted to the model of class A;
2.samples in class A fitted to the model of class B;
248 |
CHEMOMETRICS |
|
|
3.samples in class B fitted to the model of class B;
4.samples in class B fitted to the model of class A.
We would expect matrices 2 and 4 to be a worse fit than matrices 1 and 3. The standard deviations are then calculated for these matrices to give
Dj = |
classAmodelB sjresid |
2 |
classBmodelAsjresid 2 |
|
|
+ |
|
classAmodelAsjresid |
2 |
classBmodelB sjresid 2 |
|
|
|
|
+ |
The bigger the value, the higher is the discriminatory power. This could be useful information, for example if clinical or forensic measurements are expensive, so allowing the experimenter to choose only the most effective measurements. Discriminatory power can be calculated between any two classes.
4.5.3.3 Validation
Like all methods for supervised pattern recognition, testing and cross-validation are possible in SIMCA. There is a mystique in the chemometrics literature whereby some general procedures are often mistakenly associated with a particular package or algorithm; this is largely because the advocates promote specific strategies that form part of papers or software that is widely used. It also should be recognised that there are two different needs for validation: the first is to determine the optimum number of PCs for a given class model, and the second to determine whether unknowns are classified adequately.
4.5.4 Discriminant PLS
An important variant that is gaining popularity in chemometrics circles is called discriminant PLS (DPLS). We describe the PLS method in more detail in Chapter 5, Section 5.4 and also Appendix A.2, but it is essentially another approach to regression, complementary to MLR with certain advantages in particular situations. The principle is fairly simple. For each class, set up a model
cˆ = T .q
where T are the PLS scores obtained from the original data, q is a vector, the length equalling the number of significant PLS components, and cˆ is a class membership function; this is obtained by PLS regression from an original c vector whose elements have values of 1 if an object is a member of a class and 0 otherwise and an X matrix
ˆ
consisting of the original preprocessed data. If there are three classes, then a matrix C, of dimensions I × 3, can be obtained from a training set consisting of I objects; the closer each element is to 1, the more likely an object is to be a member of the particular class. All the normal procedures of training and test sets, and cross-validation, can be used with DPLS, and various extensions have been proposed in the literature.
MLR regression on to a class membership function would not work well because the aim of conventional regression is to model the data exactly and it is unlikely that a good relationship could be obtained, but PLS does not require an exact fit to the data, so DPLS is more effective. Because we describe the PLS method in detail in
PATTERN RECOGNITION |
249 |
|
|
Chapter 5, we will not give a numerical example in this chapter, but if the number of variables is fairly large, approaches such as Mahalanobis distance are not effective unless there is variable reduction either by first using PCA or simply selecting some of the measurements, so the approach discussed in this section is worth trying.
4.5.5 K Nearest Neighbours
The methods of SIMCA, discriminant analysis and DPLS involve producing statistical models, such as principal components and canonical variates. Nearest neighbour methods are conceptually much simpler, and do not require elaborate statistical computations.
The KNN (or K nearest neighbour) method has been with chemists for over 30 years. The algorithm starts with a number of objects assigned to each class. Figure 4.36 represents five objects belonging to two classes A and class B recorded using two measurements, which may, for example, be chromatographic peak areas or absorption intensities at two wavelengths; the raw data are presented in Table 4.28.
4.5.5.1 Methodology
The method is implemented as follows.
1.Assign a training set to known classes.
2.Calculate the distance of an unknown to all members of the training set (see Table 4.28). Usually a simple ‘Euclidean’ distance is computed, see Section 4.4.1.
3.Rank these in order (1 = smallest distance, and so on).
2 x
10.00
9.00
8.00
Class B
7.00
6.00
5.00 |
|
|
|
|
|
|
|
|
|
|
Class A |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.00 |
2.00 |
4.00 |
6.00 |
8.00 |
10.00 |
12.00 |
|||||||||
|
|
|
|
|
|
|
|
x1 |
|
|
|
|
|
|
Figure 4.36
Two groups
250 |
|
|
|
|
CHEMOMETRICS |
|
|
|
|
||||
|
Table 4.28 Example for KNN classification: the three closest distances |
|||||
|
are indicated in bold. |
|
|
|
|
|
|
|
|
|
|
|
|
|
Class |
x1 |
x2 |
Distance to unknown |
Rank |
|
|
A |
5.77 |
8.86 |
3.86 |
6 |
|
|
A |
10.54 |
5.21 |
5.76 |
10 |
|
|
A |
7.16 |
4.89 |
2.39 |
4 |
|
|
A |
10.53 |
5.05 |
5.75 |
9 |
|
|
A |
8.96 |
3.23 |
4.60 |
8 |
|
|
B |
3.11 |
6.04 |
1.91 |
3 |
|
|
B |
4.22 |
6.89 |
1.84 |
2 |
|
|
B |
6.33 |
8.99 |
4.16 |
7 |
|
|
B |
4.36 |
3.88 |
1.32 |
1 |
|
|
B |
3.54 |
8.28 |
3.39 |
5 |
|
|
Unknown |
4.78 |
5.13 |
|
|
|
|
|
|
|
|
|
|
2 x
10.00 |
|
|
|
|
|
|
9.00 |
|
|
|
|
|
|
8.00 |
|
|
|
|
|
|
7.00 |
|
|
|
|
|
|
6.00 |
|
|
|
|
|
|
5.00 |
|
|
|
|
|
|
4.00 |
|
|
|
|
|
|
3.00 |
|
|
|
|
|
|
2.00 |
|
|
|
|
|
|
1.00 |
|
|
|
|
|
|
0.00 |
|
|
|
|
|
|
0.00 |
2.00 |
4.00 |
6.00 |
8.00 |
10.00 |
12.00 |
|
|
|
x1 |
|
|
|
Figure 4.37
Three nearest neighbours to unknown
4.Pick the K smallest distances and see what classes the unknown in closest to; this number is usually a small odd number. The case where K = 3 is illustrated in Figure 4.37 for our example. All objects belong to class B.
5.Take the ‘majority vote’ and use this for classification. Note that if K = 5, one of the five closest objects belongs to class A in this case.
6.Sometimes it is useful to perform KNN analysis for a number of values of K, e.g. 3, 5 and 7, and see if the classification changes. This can be used to spot anomalies or artefacts.