Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Brereton Chemometrics

.pdf
Скачиваний:
48
Добавлен:
15.08.2013
Размер:
4.3 Mб
Скачать

PATTERN RECOGNITION

201

 

 

we recommend a single approach. Note that although some steps are common, it is normal to use different criteria when using cross-validation in multivariate calibration as described in Chapter 5, Section 5.6.2. Do not be surprised if different packages provide what appear to be different numerical answers for the estimation of similar parameters – always try to understand what the developer of the software has intended; normally extensive documentation is available.

There are various ways of interpreting these two errors numerically, but a common approach is to compare the PRESS error using a + 1 PCs with the RSS using a PCs. If the latter error is significantly larger, then the extra PC is modelling only noise, and so is not significant. Sometimes this is mathematically defined by computing the ratio PRESSa /PRESSa1 and if this exceeds 1, use a 1 PCs in the model. If the errors are close in size, it is safest to continue checking further components, and normally there will be a sharp difference when sufficient components have been computed. Often PRESS will start to increase after the optimum number of components have been calculated.

It is easiest to understand the principle using a small numerical example. Because the datasets of case studies 1 and 2 are rather large, a simulation will be introduced.

Table 4.7 is for a dataset consisting of 10 objects and eight measurements. In Table 4.8(a) the scores and loadings for eight PCs (the number is limited by the measurements) using only samples 2–10 are presented. Table 4.8(b) Illustrates the calculation of the sum of square cross-validated error for sample 1 as increasing number of PCs are calculated. In Table 4.9(a) these errors are summarised for all samples. In Table 4.9(b) eigenvalues are calculated, together with the residual sum of squares as increasing number of PCs are computed for both autopredictive and cross-validated models. The latter can be obtained by summing the rows in Table 4.9(a). RSS decreases continuously, whereas PRESS levels off. This information can be illustrated graphically (see Figure 4.8): the vertical sale is usually presented logarithmically, which takes into account the very high first eigenvalues, usual in cases where the data are uncentred, and so the first eigenvalue is mainly one of size and can appear (falsely) to dominate the data. Using the criteria above, the PRESS value of the fourth PC is greater than the RSS of the third PC, so an optimal model would appear to consist of three PCs. A simple graphical approach, taking the optimum number of PCs to be where the graph of PRESS levels off or increases, would likewise suggest that there are three PCs in the model. Sometimes PRESS values increase after the optimum number of components has been calculated, but this is not so in this example.

There are, of course, many other modifications of cross-validation, and two of the most common are listed below.

1.Instead of removing one object at a time, remove a block of objects, for example four objects, and then cycle round so that each object is part of a group. This can speed up the cross-validation algorithm. However, with modern fast computers, this enhancement is less needed.

2.Remove portions of the data rather than individual samples. Statisticians have developed a number of approaches, and some traditional chemometrics software uses this method. This involves removing a certain number of measurements and replacing them by guesses, e.g. the standard deviation of the column, performing PCA and then determining how well these measurements are predicted. If too many PCs have been employed the measurements are not predicted well.

202

 

 

 

 

 

 

 

CHEMOMETRICS

 

 

 

 

 

 

 

 

Table 4.7 Cross-validation example.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

B

C

D

E

F

G

H

 

 

 

 

 

 

 

 

 

 

1

89.821

59.760

68.502

48.099

56.296

95.478

71.116

95.701

2

97.599

88.842

95.203

71.796

97.880

113.122

72.172

92.310

3

91.043

79.551

104.336

55.900

107.807

91.229

60.906

97.735

4

30.015

22.517

60.330

21.886

53.049

23.127

12.067

37.204

5

37.438

38.294

50.967

29.938

60.807

31.974

17.472

35.718

6

83.442

48.037

59.176

47.027

43.554

84.609

67.567

81.807

7

71.200

47.990

86.850

35.600

86.857

57.643

38.631

67.779

8

37.969

15.468

33.195

12.294

32.042

25.887

27.050

37.399

9

34.604

68.132

63.888

48.687

86.538

63.560

35.904

40.778

10

74.856

36.043

61.235

37.381

53.980

64.714

48.673

73.166

 

 

 

 

 

 

 

 

 

 

Table 4.8 Calculation of cross-validated error for sample 1.

(a) Scores and loadings for first 8 PCs on 9 samples, excluding sample 1.

Scores

 

 

 

 

 

3.80

 

 

 

2.13

 

 

 

 

 

259.25

 

9.63

20.36

2.29

 

0.04

 

0.03

 

 

248.37

8.48

5.08

3.38

1.92

 

5.81

0.53

0.46

 

96.43

24.99

20.08

8.34

2.97

 

0.12

0.33

 

0.29

 

 

109.79

23.52

3.19

0.38

5.57

 

0.38

3.54

 

1.41

 

 

181.87

 

46.76

4.34

2.51

2.44

 

0.30

0.65

 

1.63

 

 

180.04

16.41

20.74

2.09

1.57

 

1.91

3.55

 

0.16

 

 

80.31

 

8.27

13.88

5.92

2.75

 

2.54

0.60

 

1.17

 

 

157.45

34.71

27.41

1.10

4.03

 

2.69

0.80

0.46

 

161.67

 

23.85

12.29

0.32

1.12

 

2.19

2.14

2.63

 

Loadings

 

 

 

0.338

0.198

0.703

 

 

0.136

 

 

 

 

 

0.379

 

0.384

0.123

 

0.167

 

0.309

0.213

0.523

0.201

0.147 0.604 0.050

 

0.396

 

0.407

0.322

0.406

0.516

0.233

0.037

0.404

 

0.286

 

0.247

0.021

0.339

0.569

0.228

0.323

0.574

 

0.118

 

0.412

0.633

0.068

0.457

0.007

0.326

0.166

0.289

 

0.388

 

0.274

0.431

0.157

0.064

0.054 0.450 0.595

 

0.263

 

0.378

0.152

0.313

0.541

0.405

0.012

 

0.459

 

0.381

 

0.291

0.346

0.011

0.286

0.491

0.506

0.267

 

(b) Predictions for sample 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Predicted

 

 

 

 

 

PC1

 

 

 

 

 

 

 

 

 

PC7

scores

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

207.655

43.985

4.453 1.055

 

4.665

6.632

0.329

 

 

 

 

 

 

 

 

 

 

 

Predictions

 

A

B

C

D

 

E

 

F

G

 

 

H

Sum of

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

square error

 

 

 

 

 

 

 

 

 

 

 

PC1

 

78.702

64.124

84.419

51.361

85.480

80.624

54.634

79.109

2025.934

 

 

95.607

54.750

70.255

50.454

57.648

92.684

71.238

91.909

91.235

 

94.102

57.078

68.449

51.964

57.346

94.602

71.916

90.369

71.405

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

94.310

57.290

67.905

51.364

57.828

94.436

72.245

90.380

70.292

 

 

91.032

56.605

68.992

50.301

57.796

94.734

74.767

91.716

48.528

 

 

 

 

 

 

90.216

60.610

69.237

48.160

55.634

94.372

72.078

94.972

4.540

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PC7

 

90.171

60.593

69.104

48.349

55.688

94.224

72.082

95.138

4.432

 

 

PATTERN RECOGNITION

 

 

 

 

 

 

 

 

203

 

 

 

 

 

 

 

 

 

Table 4.9 Calculation of RSS and PRESS.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(a) Summary of cross-validated sum of square errors

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Object

1

2

3

4

5

6

7

8

9

10

PC1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

681.1

494.5

1344.6

842

2185.2

1184.2

297.1

2704

653.5

2025.9

 

91.2

673

118.1

651.5

66.5

67.4

675.4

269.8

1655.4

283.1

 

71.4

91.6

72.7

160.1

56.7

49.5

52.5

64.6

171.6

40.3

 

70.3

89.1

69.7

159

56.5

36.2

51.4

62.1

168.5

39.3

 

 

48.5

59.4

55.5

157.4

46.7

36.1

39.4

49.9

160.8

29.9

 

 

4.5

40.8

8.8

154.5

39.5

19.5

38.2

18.9

148.4

26.5

 

 

 

 

 

 

 

 

 

 

 

 

 

PC7

4.4

0.1

2.1

115.2

30.5

18.5

27.6

10

105.1

22.6

 

 

 

 

 

 

 

(b) RSS and PRESS calculations

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Eigenvalues

RSS

PRESS

PRESSa /RSSa1

 

 

 

 

316522.1

 

10110.9

12412.1

 

 

 

 

 

 

 

7324.6

 

2786.3

4551.5

0.450

 

 

 

 

 

 

2408.7

 

377.7

 

830.9

0.298

 

 

 

 

 

 

136.0

 

241.7

 

802.2

2.124

 

 

 

 

 

 

117.7

 

123.9

 

683.7

2.829

 

 

 

 

 

 

72.9

 

51.1

 

499.7

4.031

 

 

 

 

 

 

36.1

 

15.0

 

336.2

6.586

 

 

 

 

 

 

15.0

 

0.0

 

n/a

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

4

 

 

 

 

 

 

error

3

 

 

 

 

 

 

 

 

 

 

 

 

 

Residual

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

1

2

3

4

5

6

7

 

 

 

 

Component

 

 

 

Figure 4.8

Graph of log of PRESS (top) and RSS (bottom) for dataset in Table 4.7

204

CHEMOMETRICS

 

 

As in the case of most chemometric methods, there are innumerable variations on the theme, and it is important to be careful to check every author in detail. However, the ‘leave one sample out at a time’ method described above is popular, and relatively easy to implement and understand.

4.3.4 Factor Analysis

Statisticians do not always distinguish between factor analysis and principal components analysis, but for chemists factors often have a physical significance, whereas PCs are simply abstract entities. However, it is possible to relate PCs to chemical information, such as elution profiles and spectra in HPLC–DAD by

ˆ

=

T .P

= ˆ ˆ

X

 

C.S

The conversion from ‘abstract’ to ‘chemical’ factors is sometimes called a rotation or transformation and will be discussed in more detail in Chapter 6, and is illustrated in Figure 4.9. Note that factor analysis is by no means restricted to chromatography. An example is the pH titration profile of a number of species containing different numbers of protons together with their spectra. Each equilibrium species has a pH titration profile and a characteristic spectrum.

CHROMATOGRAM

SCORES

ELUTION

PROFILES

PCA

LOADINGS

TRANSFORMATION

SPECTRA

Figure 4.9

Relationship between PCA and factor analysis in coupled chromatography

PATTERN RECOGNITION

205

 

 

Factor analysis is often called by a number of alternative names such as ‘rotation’ or ‘transformation’, but is a procedure used to relate the abstract PCs to meaningful chemical factors, and the influence of Malinowski in the 1980s introduced this terminology into chemometrics.

4.3.5 Graphical Representation of Scores and Loadings

Many revolutions in chemistry relate to the graphical presentation of information. For example, fundamental to the modern chemist’s way of thinking is the ability to draw structures on paper in a convenient and meaningful manner. Years of debate preceded the general acceptance of the Kekule´ structure for benzene: today’s organic chemist can write down and understand complex structures of natural products without the need to plough through pages of numbers of orbital densities and bond lengths. Yet, underlying these representations are quantum mechanical probabilities, so the ability to convert from numbers to a simple diagram has allowed a large community to think clearly about chemical reactions.

So with statistical data, and modern computers, it is easy to convert from numbers to graphs. Many modern multivariate statisticians think geometrically as much as numerically, and concepts such as principal components are often treated as objects in an imaginary space rather than mathematical entities. The algebra of multidimensional space is the same as that of multivariate statistics. Older texts, of course, were written before the days of modern computing, so the ability to produce graphs was more limited. However, now it is possible to obtain a large number of graphs rapidly using simple software, and much is even possible using Excel. There are many ways of visualising PCs. Below we will look primarily at graphs of first two PCs, for simplicity.

4.3.5.1 Scores Plots

One of the simplest plots is that of the score of one PC against the other. Figure 4.10 illustrates the PC plot of the first two PCs obtained from case study 1, corresponding to plotting a graph of the first two columns of Table 4.3. The horizontal axis represents the scores for the first PC and the vertical axis those for the second PC. This ‘picture’ can be interpreted as follows:

the linear regions of the graph represent regions of the chromatogram where there are pure compounds;

the curved portion represents a region of co-elution;

the closer to the origin, the lower the intensity.

Hence the PC plot suggests that the region between 6 and 10 s (approximately) is one of co-elution. The reason why this method works is that the spectrum over the chromatogram changes with elution time. During co-elution the spectral appearance changes most, and PCA uses this information.

How can these graphs help?

The pure regions can inform us about the spectra of the pure compounds.

The shape of the PC plot informs us of the amount of overlap and quality of chromatography.

206

 

 

 

 

 

 

 

 

 

CHEMOMETRICS

 

0.8

 

 

 

 

 

 

 

 

 

 

0.6

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

7

 

 

 

0.4

 

 

 

 

4

 

 

 

 

PC2

0.2

 

 

3

 

 

 

8

 

 

 

 

 

 

 

 

 

 

 

 

 

0.0

1

2

 

 

 

 

 

 

 

 

30

 

 

 

 

 

 

 

 

 

0.0

0.5

1.0

1.5

2.0

2.5

9

3.0

 

 

 

 

 

20

 

 

 

 

 

 

−0.2

 

 

 

 

15

 

 

 

10

 

 

 

 

 

 

14

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

13

12

11

 

 

 

 

 

 

 

 

 

 

 

 

−0.4

 

 

 

 

PC1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.10

Scores of PC2 (vertical axis) versus PC1 (horizontal axis) for case study 1

The number of bends in a PC plot can provide information about the number of different compounds in a complex multipeak cluster.

In cases where there is a meaningful sequential order to a dataset, as in spectroscopy or chromatography, but also, for example, where objects are related in time or pH, it is possible to plot the scores against sample number (see Figure 4.11). From this it appears that the first PC primarily relates to the magnitude of the measurements, whereas the second discriminates between the two components in the mixture, being positive for the fastest eluting component and negative for the slowest. Note that the appearance and interpretation of such plots depend crucially on data scaling, as will be discussed in Section 4.3.6. This will be described in more detail in Chapter 6, Section 6.2 in the context of evolutionary signals.

The scores plot for the first two PCs of case study 2 has already been presented in Figure 4.4. Unlike case study 1, there is no sequential significance in the order of the chromatographic columns. However, several deductions are possible.

Closely clustering objects, such as the three Inertsil columns, behave very similarly.

Objects that are diametrically opposed are ‘negatively correlated’, for example Kromasil C8 and Purospher. This means that a parameter that has a high value for Purospher is likely to have a low value for Kromasil C8 and vice versa. This would suggest that each column has a different purpose.

Scores plots can be used to answer many different questions about the relationship between objects and more examples are given in the problems at the end of this chapter.

PATTERN RECOGNITION

207

 

 

PC scores

3.0

2.5

2.0

1.5

PC1

1.0

PC2

0.5

0.0

0

5

10

15

20

25

−0.5

Datapoint

Figure 4.11

Scores of the first two PCs of case study 1 versus sample number

4.3.5.2 Loadings Plots

It is not only the scores, however, that are of interest, but also sometimes the loadings. Exactly the same principles apply in that the value of the loadings at one PC can be plotted against that at the other PC. The result for the first two PCs for case study 1 is shown in Figure 4.12. This figure looks complicated, which is because both spectra overlap and absorb at similar wavelengths. The pure spectra are presented in Figure 4.13. Now we can understand a little more about these graphs.

We can see that the top of the scores plot corresponds to the direction for the fastest eluting compound (=A), whereas the bottom corresponds to that for the slowest eluting compound (=B) (see Figure 4.10). Similar interpretation can be obtained from the loadings plots. Wavelengths in the bottom half of the graph correspond mainly to B, for example 301 and 225 nm. In Figure 4.13, these wavelengths are indicated and represent the maximum ratio of the spectral intensities of B to A. In contrast, high wavelengths, above 325 nm, belong to A, and are displayed in the top half of the graph. The characteristic peak for A at 244 nm is also obvious in the loadings plot.

Further interpretation is possible, but it can easily be seen that the loadings plots provide detailed information about which wavelengths are most associated with which compound. For complex multicomponent clusters or spectral of mixtures, this information can be very valuable, especially if the pure components are not available.

The loadings plot for case study 2 is especially interesting, and is presented in Figure 4.14. What can we say about the tests?

208

 

 

 

 

 

 

 

CHEMOMETRICS

 

0.5

 

 

 

 

 

 

 

 

0.4

 

 

244

 

 

 

 

 

 

 

239

249

 

 

 

 

 

0.3

 

 

 

 

 

 

 

0.2

 

253

234

268

272

277

 

 

325

320

 

 

 

 

 

 

282

 

 

0.1

329

258

 

 

 

 

 

263

 

 

 

 

PC2

 

334

 

315

 

 

 

 

0.0

349

 

 

 

 

287

 

 

 

 

 

 

 

 

−0.1

0.0

0.1

310

0.2

0.3

0.4

 

 

 

 

 

230

 

220

 

−0.2

 

 

 

 

306

291

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

225

 

 

−0.3

 

 

 

 

301

296

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

−0.4

 

 

 

 

 

 

 

 

 

 

 

 

PC1

 

 

Figure 4.12

Loadings plot of PC2 (vertical axis) against PC1 (horizontal axis) for case study 1

 

225

 

 

 

 

B

 

 

 

 

 

 

 

 

 

 

 

301

 

 

 

 

 

 

 

A

220

240

260

280

300

320

340

Wavelength (nm)

Figure 4.13

Pure spectra of compounds in case study 1

PATTERN RECOGNITION

209

 

 

PC2

 

 

 

 

 

0.3

 

 

 

 

 

 

 

 

DAs

RAs

 

 

 

 

 

 

PAs

 

 

 

 

 

 

 

 

 

 

 

Bk

Dk

Rk

Pk

 

0.2

 

AAs

 

 

Nk

 

 

 

Ak

Ck

 

 

 

BAs

 

 

 

 

 

 

 

 

 

 

QAs

 

 

Qk

 

 

NAs

 

 

 

 

 

 

0.1

 

 

 

 

 

 

 

 

NN

 

 

 

−0.3 QN

 

 

 

 

0

 

 

 

−0.2

 

−0.1

0

0.1

0.2

0.3

CN(df)

BN

CN

 

 

 

 

CAs

 

 

 

 

−0.1

 

 

 

QN(df)

RN

 

 

 

 

 

 

 

NN(df)

 

 

 

 

 

AN

 

 

 

 

 

 

BN(df)

DN

 

−0.2

 

 

 

 

RN(df)

 

 

 

 

 

 

DN(df)

 

 

 

 

 

 

AN(df)

 

PN

 

 

 

 

 

 

 

 

 

 

 

 

 

−0.3

 

PN(df)

 

 

 

 

 

 

 

 

 

−0.4

PC1

Figure 4.14

Loadings plots for the first two (standardised) PCs of case study 2

The k loadings are very closely clustered, suggesting that this parameter does not vary much according to compound or column. As, N and N(df) show more discrimination. N and N(df) are very closely correlated.

As and N are almost diametrically opposed, suggesting that they measure opposite properties, i.e. a high As corresponds to a low N [or N(df)] value.

Some parameters are in the middle of the loadings plots, such as NN. These behave atypically and are probably not useful indicators of column performance.

Most loadings are on an approximate circle. This is because standardisation is used, and suggests that we are probably correct in keeping only two principal components.

The order of the compounds for both As and N reading clockwise around the circle are very similar, with P, D and N at one extreme and Q and C at the other extreme. This suggests that behaviour is grouped according to chemical structure, and also that it is possible to reduce the number of test compounds by selecting one compound in each group.

These conclusions can provide very valuable experimental clues as to which tests are most useful. For example, it might be impracticable to perform large numbers of tests, so can we omit some compounds? Should we measure all these parameters, or are some of them useful and some not? Are some measurements misleading, and not typical of the overall pattern?

Loadings plots can be used to answer a lot of questions about the data, and are a very flexible facility available in almost all chemometrics software.

210

CHEMOMETRICS

 

 

4.3.5.3 Extensions

In many cases, more than two significant principal components characterise the data, but the concepts above can be employed, except that many more possible graphs can be computed. For example, if four significant components are calculated, we can produce six possible graphs, of each possible combination of PCs, for example, PC 4 versus

A1

2, or PC 3 versus 1, and so on. If there are A significant PCs, there will be a=1 a possible PC plots of one PC against another. Each graph could reveal interesting trends. It is also possible to produce three-dimensional PC plots, the axes of which consist of three PCs (normally the first three) and so visualise relationships between and clusters of variables in three-dimensional space.

4.3.6 Preprocessing

All chemometric methods are influenced by the method for data preprocessing, or preparing information prior to application of mathematical algorithms. An understanding is essential for correct interpretation from multivariate data packages, but will be illustrated with reference to PCA, and is one of the first steps in data preparation. It is often called scaling and the most appropriate choice can relate to the chemical or physical aim of the analysis. Scaling is normally performed prior to PCA, but in this chapter it is introduced afterwards as it is hard to understand how preprocessing influences the resultant models without first appreciating the main concepts of PCA.

4.3.6.1 Example

As an example, consider a data matrix consisting of 10 rows (labelled from 1 to 10) and eight columns (labelled from A to H), as in Table 4.10. This could represent a portion of a two-way HPLC–DAD data matrix, the elution profile of which in given in Figure 4.15, but similar principles apply to all multivariate data matrices. We choose a small example rather than case study 1 for this purpose, in order to be able to demonstrate all the steps numerically. The calculations are illustrated with reference to the first two PCs, but similar ideas are applicable when more components are computed.

4.3.6.2 Raw Data

The resultant principal components scores and loadings plots are given in Figure 4.16. Several conclusions are possible.

There are probably two main compounds, one which has a region of purity between points 1 and 3, and the other between points 8 and 10.

Measurements (e.g. spectral wavelengths) A, B, G and H correspond mainly to the first chemical component, whereas measurements D and E to the second chemical component.

PCA has been performed directly on the raw data, something statisticians in other disciplines very rarely do. It is important to be very careful when using for chemical

Соседние файлы в предмете Химия