Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Brereton Chemometrics

.pdf
Скачиваний:
48
Добавлен:
15.08.2013
Размер:
4.3 Mб
Скачать

PATTERN RECOGNITION

 

 

 

 

241

 

 

 

 

 

 

 

 

 

 

Table 4.26 Mahalanobis distances from class centroids.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Object

Distance to

Distance to

True

Predicted

 

 

 

 

Class A

 

Class B

classification

classification

 

 

 

 

 

 

 

 

 

 

 

1

1.25

 

5.58

A

A

2

0.13

 

3.49

A

A

3

0.84

 

2.77

A

A

4

1.60

 

3.29

A

A

5

1.68

 

1.02

A

B

6

1.54

 

0.67

A

B

7

0.98

 

5.11

A

A

8

1.36

 

5.70

A

A

9

2.27

 

3.33

A

A

10

2.53

 

0.71

B

B

11

2.91

 

2.41

B

B

12

1.20

 

1.37

B

A

13

5.52

 

2.55

B

B

14

1.57

 

0.63

B

B

15

2.45

 

0.78

B

B

16

3.32

 

1.50

B

B

17

2.04

 

0.62

B

B

18

1.21

 

1.25

B

A

19

1.98

 

0.36

B

B

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1 8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

4

 

9

 

 

 

 

 

 

 

 

 

 

 

 

 

3

 

 

 

3

 

 

 

13

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

11

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

12

 

 

16

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

18

 

 

 

 

 

 

1

 

 

 

5

 

15

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

17

10

 

 

 

 

 

 

 

 

14

 

 

 

 

 

 

 

 

 

 

19

 

 

 

 

 

0

0

1

2

3

4

5

6

Figure 4.31

Class distance plot: horizontal axis = class A, vertical axis = class B

242

CHEMOMETRICS

 

 

4.5.2.4 Extending the Methods

There are numerous extensions to the basic method in the section above, some of which are listed below.

Probabilities of class membership and class boundaries or confidence intervals can be constructed, assuming a multivariate normal distribution.

Discriminatory power of variables. This is described in a different context in Section 4.5.3, in the context of SIMCA, but similar parameters can be obtained for any method of discrimination. This can have practical consequences; for example, if variables are expensive to measure it is possible to reduce the expense by making only the most useful measurements.

The method can be extended to any number of classes. It is easy to calculate the Mahalanobis distance to several classes, and determine which is the most appropriate classification of an unknown sample, simply by finding the smallest class distance. More discriminant functions are required, however, and the computation can be rather complicated.

Instead of using raw data, it is possible to use the PCs of the data. This acts as a form of variable reduction, but also simplifies the distance measures, because the variance–covariance matrix will only contain nonzero elements on the diagonals. The expressions for Mahalanobis distance and linear discriminant functions simplify dramatically.

One interesting modification involves the use of Bayesian classification functions. Such concepts have been introduced previously (see Chapter 3) in the context of signal analysis. The principle is that membership of each class has a predefined (prior) probability, and the measurements are primarily used to refine this. If we are performing a routine screen of blood samples for drugs, the chances might be very low (less than 1 %). In other situations an expert might already have done preliminary tests and suspects that a sample belongs to a certain class, so the prior probability of class membership is much higher than in the population as a whole; this can be taken into account. Normal statistical methods assume that there is an equal chance of membership of each class, but the distance measure can be modified as follows:

diA = (x xA).CA1.(xi xA) + 2 ln(qA)

where qA is the prior probability of membership of class A. The prior probabilities of membership of all relevant classes must add up to 1. It is even possible to use the Bayesian approach to refine class membership as more data are acquired. Screening will provide a probability of class membership which can then be used as prior probability for a more rigorous test, and so on.

Several other functions such as the quadratic and regularised discriminant functions have been proposed and are suitable under certain circumstances.

It is always important to examine each specific paper and software package for details on the parameters that have been calculated. In many cases, however, fairly straightforward approaches suffice.

PATTERN RECOGNITION

243

 

 

4.5.3 SIMCA

The SIMCA method, first advocated by the S. Wold in the early 1970s, is regarded by many as a form of soft modelling used in chemical pattern recognition. Although there are some differences with linear discriminant analysis as employed in traditional statistics, the distinction is not as radical as many would believe. However, SIMCA has an important role in the history of chemometrics so it is important to understand the main steps of the method.

4.5.3.1 Principles

The acronym SIMCA stands for soft independent modelling of class analogy. The idea of soft modelling is illustrated in Figure 4.32. Two classes can overlap (and hence are ‘soft’), and there is no problem with an object belonging to both (or neither) class simultaneously. In most other areas of statistics, we insist that an object belongs to a discrete class, hence the concept of hard modelling. For example, a biologist trying to determine the sex of an animal from circumstantial evidence (e.g. urine samples) knows that the animal cannot simultaneously belong to two sexes at the same time, and a forensic scientist trying to determine whether a banknote is forged or not knows

Figure 4.32

Two overlapping classes

244

CHEMOMETRICS

 

 

that there can be only one true answer: if this appears not to be so, the problem lies with the quality of the evidence. The philosophy of soft modelling is that, in many situations in chemistry, it is entirely legitimate for an object to fit into more than one class simultaneously, for example a compound may have an ester and an alkene group, and so will exhibit spectroscopic characteristics of both functionalities, so a method that assumes that the answer must be either an ester or an alkene is unrealistic. In practice, though, it is possible to calculate class distances from discriminant analysis (Section 4.5.2.3) that are close to two or more groups.

Independent modelling of classes, however, is a more useful feature. After making a number of measurements on ketones and alkenes, we may decide to include amides in the model. Figure 4.33 represents the effect of this additional class which can be added independently to the existing model without any changes. In contrast, using classical discriminant analysis the entire modelling procedure must be repeated if extra numbers of groups are added, since the pooled variance–covariance matrix must be recalculated.

4.5.3.2 Methodology

The main steps of SIMCA are as follows.

Principal Components Analysis

Each group is independently modelled using PCA. Note that each group could be described by a different number of PCs. Figure 4.34 represents two groups, each

Figure 4.33

Three overlapping classes

PATTERN RECOGNITION

245

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.34

Two groups obtained from three measurements

characterised by three raw measurements, e.g. chromatographic peak heights or physical properties. However, one group falls mainly on a straight line, defined as the first principal component of the group, whereas the other falls roughly on a plane whose axes are defined by the first two principal components of this group. When we perform discriminant analysis we can also use PCs prior to classification, but the difference is that the PCs are of the entire dataset (which may consist of several groups) rather than of each group separately.

It is important to note that a number of methods have been proposed for determining how many PCs are most suited to describe a class, which have been described in Section 4.3.3. The original advocates of SIMCA used cross-validation, but there is no reason why one cannot ‘pick and mix’ various steps in different methods.

Class Distance

The class distance can be calculated as the geometric distance from the PC models; see the illustration in Figure 4.35. The unknown is much closer to the plane formed than the line, and so is tentatively assigned to this class. A more elaborate approach is often employed in which each group is bounded by a region of space, which represents 95 % confidence that a particular object belongs to a class. Hence geometric class distances can be converted to statistical probabilities.

Modelling Power

The modelling power of each variable for each separate class is defined by

Mj = 1 sjresid /sjraw

246

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CHEMOMETRICS

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.35

Class distance of unknown object (represented by an asterisk) to two classes in Figure 4.34

where sjraw is the standard deviation of the variable in the raw data and sjresid the standard deviation of the variable in the residuals given by

E = X T .P

which is the difference between the observed data and the PC model for the class. This is illustrated in Table 4.27 for a single class, with the following steps.

1.The raw data consist of seven objects and six measurements, forming a 7 × 6 matrix. The standard deviation of all six variables in calculated.

2.PCA is performed on this data and two PCs are retained. The scores (T) and loadings (P) matrices are computed.

3.The original data are estimated by PCA by multiplying T and P together.

4.The residuals, which are the difference between the datasets obtained in steps 1 and 3, are computed (note that there are several alternative ways of doing this). The standard deviation of the residuals for all six variables is computed.

5.Finally, using the standard deviations obtained in steps 1 and 4, the modelling power of the variables for this class is obtained.

The modelling power varies between 1 (excellent) and 0 (no discrimination). Variables with M below 0.5 are of little use. In this example, it can be seen that variables 4 and 6 are not very helpful, whereas variables 2 and 5 are extremely useful. This information can be used to reduce the number of measurements.

PATTERN RECOGNITION

247

 

 

Table 4.27 Calculation of modelling power in SIMCA.

1.

Raw data

24.990

122.326

68.195

 

28.452

105.156

48.585

 

 

 

 

 

14.823

51.048

57.559

 

21.632

62.764

26.556

 

 

28.612

171.254

114.479

 

43.287

159.077

65.354

 

 

41.462

189.286

113.007

 

25.289

173.303

50.215

 

 

5.168

50.860

37.123

 

16.890

57.282

35.782

 

 

20.291

88.592

63.509

 

14.383

85.250

35.784

 

 

31.760

151.074

78.699

 

34.493

141.554

56.030

sjraw

10.954

51.941

26.527

 

9.363

43.142

12.447

2.

PCA

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Scores

 

Loadings

 

 

 

 

 

 

 

 

 

 

 

 

 

185.319

0.168

0.128

0.200

 

 

 

103.375

19.730

0.636

0.509

 

 

 

273.044

9.967

0.396

0.453

 

 

 

288.140

19.083

0.134

0.453

 

 

 

92.149

15.451

0.594

0.023

 

 

 

144.788

3.371

0.229

0.539

 

 

 

232.738

5.179

 

 

 

 

 

 

3. Data estimated by PCA

 

 

 

 

 

 

 

 

23.746

117.700

73.458

 

24.861

110.006

42.551

 

 

9.325

55.652

49.869

 

22.754

60.906

34.313

 

 

33.046

168.464

112.633

 

41.028

161.853

67.929

 

 

40.785

192.858

105.454

 

29.902

171.492

55.739

 

 

8.739

50.697

43.486

 

19.316

54.342

29.436

 

 

17.906

90.307

58.859

 

20.890

85.871

34.990

 

 

30.899

150.562

89.814

 

28.784

138.280

50.536

4.

Residuals

1.244

4.626

5.263

3.591

4.850

6.034

 

 

 

 

5.498

4.604

7.689

 

1.122

1.858

7.757

 

 

4.434

2.790

1.846

 

2.259

2.776

2.575

 

 

0.677

3.572

7.553

 

4.612

1.811

5.524

 

 

3.570

0.163

6.363

2.426

2.940

6.346

 

 

2.385

1.715

4.649

 

6.508

0.621

0.794

 

 

0.861

0.512

11.115

5.709

3.274

5.494

 

sjresid

3.164

3.068

6.895

 

4.140

2.862

5.394

5.

Modelling power

 

 

 

 

 

 

 

 

0.711

0.941

0.740

 

0.558

0.934

0.567

 

 

 

 

 

 

 

 

 

Discriminatory Power

Another measure is how well a variable discriminates between two classes. This is distinct from modelling power – being able to model one class well does not necessarily imply being able to discriminate two groups effectively. In order to determine this, it is necessary to fit each sample to both class models. For example, fit sample 1 to the PC model of both class A and class B. The residual matrices are then calculated, just as for discriminatory power, but there are now four such matrices:

1.samples in class A fitted to the model of class A;

2.samples in class A fitted to the model of class B;

248

CHEMOMETRICS

 

 

3.samples in class B fitted to the model of class B;

4.samples in class B fitted to the model of class A.

We would expect matrices 2 and 4 to be a worse fit than matrices 1 and 3. The standard deviations are then calculated for these matrices to give

Dj =

classAmodelB sjresid

2

classBmodelAsjresid 2

 

 

+

classAmodelAsjresid

2

classBmodelB sjresid 2

 

 

 

+

The bigger the value, the higher is the discriminatory power. This could be useful information, for example if clinical or forensic measurements are expensive, so allowing the experimenter to choose only the most effective measurements. Discriminatory power can be calculated between any two classes.

4.5.3.3 Validation

Like all methods for supervised pattern recognition, testing and cross-validation are possible in SIMCA. There is a mystique in the chemometrics literature whereby some general procedures are often mistakenly associated with a particular package or algorithm; this is largely because the advocates promote specific strategies that form part of papers or software that is widely used. It also should be recognised that there are two different needs for validation: the first is to determine the optimum number of PCs for a given class model, and the second to determine whether unknowns are classified adequately.

4.5.4 Discriminant PLS

An important variant that is gaining popularity in chemometrics circles is called discriminant PLS (DPLS). We describe the PLS method in more detail in Chapter 5, Section 5.4 and also Appendix A.2, but it is essentially another approach to regression, complementary to MLR with certain advantages in particular situations. The principle is fairly simple. For each class, set up a model

cˆ = T .q

where T are the PLS scores obtained from the original data, q is a vector, the length equalling the number of significant PLS components, and cˆ is a class membership function; this is obtained by PLS regression from an original c vector whose elements have values of 1 if an object is a member of a class and 0 otherwise and an X matrix

ˆ

consisting of the original preprocessed data. If there are three classes, then a matrix C, of dimensions I × 3, can be obtained from a training set consisting of I objects; the closer each element is to 1, the more likely an object is to be a member of the particular class. All the normal procedures of training and test sets, and cross-validation, can be used with DPLS, and various extensions have been proposed in the literature.

MLR regression on to a class membership function would not work well because the aim of conventional regression is to model the data exactly and it is unlikely that a good relationship could be obtained, but PLS does not require an exact fit to the data, so DPLS is more effective. Because we describe the PLS method in detail in

PATTERN RECOGNITION

249

 

 

Chapter 5, we will not give a numerical example in this chapter, but if the number of variables is fairly large, approaches such as Mahalanobis distance are not effective unless there is variable reduction either by first using PCA or simply selecting some of the measurements, so the approach discussed in this section is worth trying.

4.5.5 K Nearest Neighbours

The methods of SIMCA, discriminant analysis and DPLS involve producing statistical models, such as principal components and canonical variates. Nearest neighbour methods are conceptually much simpler, and do not require elaborate statistical computations.

The KNN (or K nearest neighbour) method has been with chemists for over 30 years. The algorithm starts with a number of objects assigned to each class. Figure 4.36 represents five objects belonging to two classes A and class B recorded using two measurements, which may, for example, be chromatographic peak areas or absorption intensities at two wavelengths; the raw data are presented in Table 4.28.

4.5.5.1 Methodology

The method is implemented as follows.

1.Assign a training set to known classes.

2.Calculate the distance of an unknown to all members of the training set (see Table 4.28). Usually a simple ‘Euclidean’ distance is computed, see Section 4.4.1.

3.Rank these in order (1 = smallest distance, and so on).

2 x

10.00

9.00

8.00

Class B

7.00

6.00

5.00

 

 

 

 

 

 

 

 

 

 

Class A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4.00

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3.00

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2.00

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1.00

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.00

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.00

2.00

4.00

6.00

8.00

10.00

12.00

 

 

 

 

 

 

 

 

x1

 

 

 

 

 

 

Figure 4.36

Two groups

250

 

 

 

 

CHEMOMETRICS

 

 

 

 

Table 4.28 Example for KNN classification: the three closest distances

 

are indicated in bold.

 

 

 

 

 

 

 

 

 

 

 

 

Class

x1

x2

Distance to unknown

Rank

 

A

5.77

8.86

3.86

6

 

 

A

10.54

5.21

5.76

10

 

 

A

7.16

4.89

2.39

4

 

 

A

10.53

5.05

5.75

9

 

 

A

8.96

3.23

4.60

8

 

 

B

3.11

6.04

1.91

3

 

 

B

4.22

6.89

1.84

2

 

 

B

6.33

8.99

4.16

7

 

 

B

4.36

3.88

1.32

1

 

 

B

3.54

8.28

3.39

5

 

 

Unknown

4.78

5.13

 

 

 

 

 

 

 

 

 

 

2 x

10.00

 

 

 

 

 

 

9.00

 

 

 

 

 

 

8.00

 

 

 

 

 

 

7.00

 

 

 

 

 

 

6.00

 

 

 

 

 

 

5.00

 

 

 

 

 

 

4.00

 

 

 

 

 

 

3.00

 

 

 

 

 

 

2.00

 

 

 

 

 

 

1.00

 

 

 

 

 

 

0.00

 

 

 

 

 

 

0.00

2.00

4.00

6.00

8.00

10.00

12.00

 

 

 

x1

 

 

 

Figure 4.37

Three nearest neighbours to unknown

4.Pick the K smallest distances and see what classes the unknown in closest to; this number is usually a small odd number. The case where K = 3 is illustrated in Figure 4.37 for our example. All objects belong to class B.

5.Take the ‘majority vote’ and use this for classification. Note that if K = 5, one of the five closest objects belongs to class A in this case.

6.Sometimes it is useful to perform KNN analysis for a number of values of K, e.g. 3, 5 and 7, and see if the classification changes. This can be used to spot anomalies or artefacts.

Соседние файлы в предмете Химия