Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Brereton Chemometrics

.pdf
Скачиваний:
48
Добавлен:
15.08.2013
Размер:
4.3 Mб
Скачать

PATTERN RECOGNITION

231

 

 

4.5.1 General Principles

Although there are numerous algorithms in the literature, chemists and statisticians often use a common strategy for classification no matter what algorithm is employed.

4.5.1.1 Modelling the Training Set

The first step is normally to produce a mathematical model between some measurements (e.g. spectra) on a series of objects and their known groups. These objects are called a training set. For example, a training set might consist of the near-infrared spectra of 30 orange juices, 10 known to be from Spain, 10 known to be from Brazil and 10 known to be adulterated. Can we produce a mathematical equation that predicts the class to which an orange juice belongs from its spectrum?

Once this has been done, it is usual to determine how well the model predicts the groups. Table 4.21 illustrates a possible scenario. Of the 30 spectra, 24 are correctly classified, as indicated along the diagonals. Some classes are modelled better than others, for example, nine out of 10 of the Spanish orange juices are correctly classified, but only seven of the Brazilians. A parameter representing the percentage correctly classified (%CC) can be calculated. After application of the algorithm, the origin (or class) of each spectrum is predicted. In this case, the overall value of %CC is 80 %. Note that some groups appear to be better modelled than others, but also that the training set is fairly small, so it may not be particularly significant that seven out of 10 are correctly classified in one group compared with nine in another. A difficulty in many real situations is that it can be expensive to perform experiments that result in large training sets. There appears to be some risk of making a mistake, but many spectroscopic techniques are used for screening, and there is a high chance that suspect orange juices (e.g. those adulterated) would be detected, which could then be subject to further detailed analysis. Chemometrics combined with spectroscopy acts like a ‘sniffer dog’ in a customs checkpoint trying to detect drugs. The dog may miss some cases, and may even get excited when there are no drugs, but there will be a good chance the dog is correct. Proof, however, only comes when the suitcase is opened. Spectroscopy is a common method for screening. Further investigations might involve subjecting a small number of samples to intensive, expensive, and in some cases commercially or politically sensitive tests, and a laboratory can only afford to look in detail at a portion of samples, just as customs officers do not open every suitcase.

4.5.1.2 Test Sets and Cross-Validation

However, normally training sets give fairly good predictions, because the model itself has been formed using these datasets, but this does not mean that the method is yet

Table 4.21 Predictions from a training set.

Known

 

Predicted

 

 

Correct

%CC

 

 

 

 

 

 

 

 

Spain

Brazil

Adulterated

 

 

 

 

 

 

 

 

 

Spain

9

0

1

 

9

90

Brazil

1

7

2

 

7

70

Adulterated

0

2

8

 

8

80

Overall

 

 

 

 

24

80

 

 

 

 

 

 

 

232

CHEMOMETRICS

 

 

Table 4.22 Predictions from a test set.

 

 

Predicted

 

 

Correct

%CC

 

 

 

 

 

 

 

 

Spain

Brazil

Adulterated

 

 

 

 

 

 

 

 

 

Spain

5

3

2

 

5

50

Brazil

1

6

3

 

6

60

Adulterated

4

2

4

 

4

40

Overall

 

 

 

 

15

50

 

 

 

 

 

 

 

safe to use in practical situations. A recommended next step is to test the quality of predictions using an independent test set. This is a series of samples that has been left out of the original calculations, and is like a ‘blind test’. These samples are assumed to be of unknown class membership at first, then the model from the training set is applied to these extra samples. Table 4.22 presents the predictions from a test set (which does not necessarily need to be the same size as the training set), and we see that now only 50 % are correctly classified so the model is not particularly good. The %CC will almost always be lower for the test set.

Using a test set to determine the quality of predictions is a form of validation. The test set could be obtained, experimentally, in a variety of ways, for example 60 orange juices might be analysed in the first place, and then randomly divided into 30 for the training set and 30 for the test set. Alternatively, the test set could have been produced in an independent laboratory.

A second approach is cross-validation. This technique was introduced in the context of PCA in Section 4.3.3.2, and other applications will be described in Chapter 5 Section 5.6.2, and so will be introduced only briefly below. Only a single training set is required, but what happens is that one (or a group) of objects is removed at a time, and a model determined on the remaining samples. Then the predicted class membership of the object (or set of objects) left out is tested. This procedure is repeated until all objects have been left out. For example, it would be possible to produce a class model using 29 orange juices. Is the 30th orange juice correctly classified? If so, this counts towards the percentage correctly classified. Then, instead of removing the 30th orange juice, we decide to remove the 29th and see what happens. This is repeated 30 times, which leads to a value of %CC for cross-validation. Normally the cross-validated %CC is lower (worse) than that for the training set. In this context cross-validation is not used to obtain a numerical error, unlike in PCA, but the proportion assigned to correct groups. However, if the %CC of the training and test set and cross-validation are all very similar, the model is considered to be a good one. Where alarm bells ring is if the %CC is high for the training set but significantly lower when using one of the two methods for validation. It is recommended that all classification methods be validated in some way, but sometimes there can be limitations on the number of samples available. Note that there are many very different types of cross-validation available according to methods, so the use in this section differs strongly from other applications discussed in this text.

4.5.1.3 Improving the Data

If the model is not very satisfactory, there are a number of ways to improve it. The first is to use a different computational algorithm. The second is to modify the existing

PATTERN RECOGNITION

233

 

 

method – a common approach might involve wavelength selection in spectroscopy; for example, instead of using an entire spectrum, many wavelengths which are not very meaningful, can we select the most diagnostic parts of the spectrum? Finally, if all else fails, the analytical technique might not be up to scratch. Sometimes a low %CC may be acceptable in the case of screening; however, if the results are to be used to make a judgement (for example in the validation of the quality of pharmaceutical products into ‘acceptable’ and ‘unacceptable’ groups), a higher %CC of the validation sets is mandatory. The limits of acceptability are not primarily determined statistically, but according to physical needs.

4.5.1.4 Applying the Model

Once a satisfactory model is available, it can then be applied to unknown samples, using analytical data such as spectra or chromatograms, to make predictions. Usually by this stage, special software is required that is tailor-made for a specific application and measurement technique. The software will also have to determine whether a new sample really fits into the training set or not. One major difficulty is the detection of outliers that belong to none of the previously studied groups, for example if a Cypriot orange juice sample was measured when the training set consists just of Spanish and Brazilian orange juices. In areas such as clinical and forensic science, outlier detection can be important, indeed an incorrect conviction or inaccurate medical diagnosis could be obtained otherwise.

Another important consideration is the stability of the method over time; for example, instruments tend to perform slightly differently every day. Sometimes this can have a serious influence on the classification ability of chemometrics algorithms. One way round this is to perform a small test on the instrument on a regular basis.

However, there have been some significant successes, a major area being in industrial process control using near-infrared spectroscopy. A manufacturing plant may produce samples on a continuous basis, but there are a large number of factors that could result in an unacceptable product. The implications of producing substandard batches may be economic, legal and environmental, so continuous testing using a quick and easy method such as on-line spectroscopy is valuable for the rapid detection of whether a process is going wrong. Chemometrics can be used to classify the spectra into acceptable or otherwise, and so allow the operator to close down a manufacturing plant in real time if it looks as if a batch can no longer be assigned to the group of acceptable samples.

4.5.2 Discriminant Analysis

Most traditional approaches to classification in science are called discriminant analysis and are often also called forms of ‘hard modelling’. The majority of statistically based software packages such as SAS, BMDP and SPSS contain substantial numbers of procedures, referred to by various names such as linear (or Fisher) discriminant analysis and canonical variates analysis. There is a substantial statistical literature in this area.

4.5.2.1 Univariate Classification

The simplest form of classification is univariate, where one measurement or variable is used to divide objects into groups. An example may be a blood alcohol reading.

234

CHEMOMETRICS

 

 

If a reading on a meter in a police station is above a certain level, then the suspect will be prosecuted for drink driving, otherwise not. Even in such a simple situation, there can be ambiguities, for example measurement errors and metabolic differences between people.

4.5.2.2 Multivariate Models

More often, several measurements are required to determine the group to which a sample belongs. Consider performing two measurements and producing a graph of the values of these measurements for two groups, as in Figure 4.25. The objects represented

Class A

 

line 1

Class B

line 2

 

Class A

Class B

 

 

 

 

 

 

 

 

 

centre

centre

Figure 4.25

Discrimination between two classes, and projections

PATTERN RECOGNITION

235

 

 

by circles are clearly distinct from those represented by squares, but neither of the two measurements alone can discriminate between these groups, and therefore both are essential for classification. It is possible, however, to draw a line between the two groups. If above the line, an object belongs to class A, otherwise to class B.

Graphically this can be represented by projecting the objects on to a line at right angles to the discriminating line, as demonstrated in the figure. The projection can now be converted to a position along line 2 of the figure. The distance can be converted to a number, analogous to a ‘score’. Objects with lower values belong to class A, whereas those with higher values belong to class B. It is possible to determine class membership simply according to whether the value is above or below a divisor. Alternatively, it is possible to determine the centre of each class along the projection and if the distance to the centre of class A is greater than that to class B, the object is placed in class A, and vice versa, but this depends on each class being roughly equally diffuse.

It is not always possible to divide the classes exactly into two groups by this method (see Figure 4.26), but the misclassified samples are far from the centre of both classes, with two class distances that are approximately equal. It would be possible to define a boundary towards the centre of the overall dataset, where classification is deemed to be ambiguous.

The data can also be presented in the form of a distance plot, where the two axes are the distances to the centres of the projections of each class as presented in Figure 4.27. This figure probably does not tell us much that cannot be derived from Figure 4.26. However, the raw data actually consist of more than one measurement, and it is possible to calculate the Euclidean class distance using the raw two-dimensional information, by computing the centroids of each class in the raw data rather than one-dimensional projection. Now the points can fall anywhere on a plane, as illustrated in Figure 4.28. This graph is often called a class distance plot and can still be divided into four regions:

1.top left: almost certainly class A;

2.bottom left: possibly a member of both classes, but it might be that we do not have enough information;

3.bottom right: almost certainly class B;

4.top right: unlikely to be a member of either class, sometimes called an outlier.

Class A

 

line 1

Class B

line 2

 

Class A

Class B

 

 

 

 

 

 

 

 

 

 

 

 

centre

centre

Figure 4.26

Discrimination where exact cut-off is not possible

236

CHEMOMETRICS

 

 

Ambiguous

Class A

Class B

Figure 4.27

Distance plot to class centroids of the projection in Figure 4.26

Centre class A

Class distances

Centre class B

Figure 4.28

Class distance plot

In chemistry, these four divisions are perfectly reasonable. For example, if we try to use spectra to classify compounds into ketones and esters, there may be some compounds that are both or neither. If, on the other hand, there are only two possible classifications, for example whether a manufacturing sample is acceptable or not, a conclusion about objects in the bottom left or top right is that the analytical data is insufficiently good to allow us to assign conclusively a sample to a graph. This is a valuable conclusion, for example it is helpful to tell a laboratory that their clinical diagnosis or forensic test is inconclusive and that if they want better evidence they should perform more experiments or analyses.

4.5.2.3 Mahalanobis Distance and Linear Discriminant Functions

Previously we discussed the use of different similarity measures in cluster analysis (Section 4.4.1), including various approaches for determining the distance between

PATTERN RECOGNITION

237

 

 

objects. Many chemometricians use the Mahalanobis distance, sometimes called the ‘statistical’ distance, between objects, and we will expand on the concept below.

In areas such as spectroscopy it is normal that some wavelengths or regions of the spectra are more useful than others for discriminant analysis. This is especially true in near-infrared (NIR) spectroscopy. Also, different parts of a spectrum might be of very different intensities. Finally, some classes are more diffuse than others. A good example is in forensic science, where forgeries often have a wider dispersion to legitimate objects. A forger might work in his or her back room or garage, and there can be a considerable spread in quality, whereas the genuine article is probably manufactured under much stricter specifications. Hence a large deviation from the mean may not be significant in the case of a class of forgeries. The Mahalanobis distance takes this information into account. Using a Euclidean distance each measurement assumes equal significance, so correlated variables, which may represent an irrelevant feature, can have a disproportionate influence on the analysis.

In supervised pattern recognition, a major aim is to define the distance of an object from the centre of a class. There are two principle uses of statistical distances. The first is to obtain a measurement analogous to a score, often called the linear discriminant function, first proposed by the statistician R A Fisher. This differs from the distance above in that it is a single number if there are only two classes. It is analogous to the distance along line 2 in Figure 4.26, but defined by

 

fi

=

(

 

A

 

B ).C1 .xi

 

x

x

 

 

 

 

 

 

AB

where

 

(NA 1)CA + (NB 1)CB

C

AB =

 

 

 

 

(NA + NB 2)

which is often called the pooled variance–covariance matrix, and can be extended to any number of groups; NA represents the number of objects in group A, and CA the variance–covariance matrix for this group (whose diagonal elements correspond to the variance of each variable and the off-diagonal elements the covariance – use the population rather than sample formula), with xA the corresponding centroid. Note that the mathematics becomes more complex if there are more than two groups. This function can take on negative values.

The second is to determine the Mahalanobis distance to the centroid of any given group, a form of class distance. There will be a separate distance to the centre of each group defined, for class A, by

diA = (xi xA).CA1.(xi xA)

where xi is a row vector for sample i and xA is the mean measurement (or centroid) for class A. This measures the scaled distance to the centroid of a class analogous to Figure 4.28, but scaling the variables using the Mahalanobis rather than Euclidean criterion.

An important difficulty with using this distance is that the number of objects much be significantly larger than the number of measurements. Consider the case of Mahalanobis distance being used to determine within group distances. If there are J measurements than there must be at least J + 2 objects for there to be any discrimination. If there

238

CHEMOMETRICS

 

 

are less than J + 1 measurements, the variance–covariance matrix will not have an inverse. If there are J + 1 objects, the estimated squared distance to the centre of the cluster will equal J for each object no matter what its position in the group, and discrimination will only be possible if the class consists of at least J + 2 objects, unless some measurements are discarded or combined. This is illustrated in Table 4.23 (as can be verified computationally), for a simple dataset.

Note how the average squared distance from the mean of the dataset over all objects always equals 5 (=J ) no matter how large the group. It is important always to understand the fundamental properties of this distance measure, especially in spectroscopy or chromatography where there are usually a large number of potential variables which must first be reduced, sometimes by PCA.

We will illustrate the methods using a simple numerical example (Table 4.24), consisting of 19 samples, the first nine of which are members of group A, and the second 10 of group B. The data are presented in Figure 4.29. Although the top left-hand

Table 4.23 Squared Mahalanobis distance from the centre of a dataset as increasing number of objects are included.

 

A

B

C

D

E

6 objects

7 objects

8 objects

9 objects

 

 

 

 

 

 

 

 

 

 

1

0.9

0.5

0.2

1.6

1.5

5

5.832

6.489

7.388

2

0.3

0.3

0.6

0.7

0.1

5

1.163

1.368

1.659

3

0.7

0.7

0.1

0.9

0.5

5

5.597

6.508

6.368

4

0.1

0.4

1.1

1.3

0.2

5

5.091

4.457

3.938

5

1

0.7

2.6

2.1

0.4

5

5.989

6.821

7.531

6

0.3

0.1

0.5

0.5

0.1

5

5.512

2.346

2.759

7

0.9

0.1

0.5

0.6

0.7

 

5.817

5.015

4.611

8

0.3

1.2

0.7

0.1

1.4

 

 

6.996

7.509

9

1

0.7

0.6

0.5

0.9

 

 

 

3.236

 

 

 

 

 

 

 

 

 

 

Table 4.24 Example of discriminant analysis.

Class

Sample

x1

x2

A

1

79

157

A

2

77

123

A

3

97

123

A

4

113

139

A

5

76

72

A

6

96

88

A

7

76

148

A

8

65

151

A

9

32

88

B

10

128

104

B

11

65

35

B

12

77

86

B

13

193

109

B

14

93

84

B

15

112

76

B

16

149

122

B

17

98

74

B

18

94

97

B

19

111

93

 

 

 

 

PATTERN RECOGNITION

239

 

 

180

160

140

120

100

2 x

80

60

40

20

0

0

1

 

8

7

 

 

 

 

 

 

 

 

 

 

4

 

 

2

3

16

 

 

 

 

13

 

 

 

18

10

9

 

12

6

19

 

14

 

 

 

 

 

5

17

15

 

 

 

11

 

 

50

100

150

200

250

x1

Figure 4.29

Graph of data in Table 4.24: class A is indicated by diamonds and class B by circles

corner corresponds mainly to group A, and the bottom right-hand corner to group B, no single measurement is able to discriminate and there is a region in the centre where both classes are represented, and it is not possible to draw a line that unambiguously distinguishes the two classes.

The calculation of the linear discriminant function is presented in Table 4.25 and the values are plotted in Figure 4.30. It can be seen that objects 5, 6, 12 and 18 are not easy to classify. The centroids of each class in this new dataset using the linear discriminant function can be calculated, and the distance from these values could be calculated; however, this would result in a diagram comparable to Figure 4.27, missing information obtained by taking two measurements.

The class distances using both variables and the Mahalanobis method are presented in Table 4.26. The predicted class for each object is the one whose centroid it is closest to. Objects 5, 6, 12 and 18 are still misclassified, making a %CC of 79 %. However, it is not always a good idea to make hard and fast deductions, as discussed above, as in certain situations an object could belong to two groups simultaneously (e.g. a compound having two functionalities), or the quality of the analytical data may be insufficient for classification. The class distance plot is presented in Figure 4.31. The data are better spread out compared with Figure 4.29 and there are no objects in the top right-hand corner; the misclassified objects are in the bottom left-hand corner. The boundaries for each class can be calculated using statistical considerations, and are normally available from most packages. Depending on the aim of the analysis, it is possible to select samples that are approximately equally far from the centroids of both classes and either reject them or subject them to more measurements.

240

 

 

 

 

CHEMOMETRICS

 

 

 

 

Table 4.25 Calculation of discriminant function for data

 

in Table 4.24.

 

 

 

 

 

 

 

 

 

 

 

Class A

 

Class B

 

 

 

 

 

 

 

 

 

Covariance matrix

 

 

 

 

466.22

142.22

1250.2

588.1

 

142.22

870.67

588.1

512.8

 

 

Centroid

 

 

 

 

79

121

112

88

 

(xA xB )

33 33

CAB

881.27378.28

378.28681.21

 

 

CAB1

0.00083

 

0.00149

 

 

 

0.0008

 

 

 

0.00198

 

(

 

 

 

 

.C

1

 

x

A

x

B )

 

 

 

 

 

AB

 

0.0765

 

 

 

0.0909

 

Linear discriminant function

1.79

1

 

8.23

 

11

2

 

5.29

 

12

1.93

3

 

3.77

 

13

4.85

4

 

4.00

 

14

0.525

5

 

0.73

 

15

1.661

6

 

0.66

 

16

0.306

7

 

7.646

 

17

0.77

8

 

8.76

 

18

1.63

9

 

5.556

 

19

0.03

10

 

0.336

 

 

 

 

13

1115 1710161914

18 12

 

 

 

 

 

 

 

 

 

−6

−4

−2

0 6

5

2

3

4

2 9

6

7

8

1

8

10

 

4

 

 

 

 

Figure 4.30

Linear discriminant function

Whereas the results in this section could probably be obtained fairly easily by inspecting the original data, numerical values of class membership have been obtained which can be converted into probabilities, assuming that the measurement error is normally distributed. In most real situations, there will be a much larger number of measurements, and discrimination (e.g. by spectroscopy) is not easy to visualise without further data analysis. Statistics such as %CC can readily be obtained from the data, and it is also possible to classify unknowns or validation samples as discussed in Section 4.5.1 by this means. Many chemometricians use the Mahalanobis distance as defined above, but the normal Euclidean distance or a wide range of other measures can also be employed, if justified by the data, just as in cluster analysis.

Соседние файлы в предмете Химия