Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Brereton Chemometrics

.pdf
Скачиваний:
48
Добавлен:
15.08.2013
Размер:
4.3 Mб
Скачать

292

CHEMOMETRICS

 

 

wavelength selection, so enabling inverse models to be produced, but there is no real advantage over classical least squares in these situations.

5.4 Principal Components Regression

MLR based methods have the disadvantage that all significant components must be known. PCA based methods do not require details about the spectra or concentrations of all the compounds in a mixture, although it is important to make a sensible estimate of how many significant components characterise a mixture, but not necessarily their characteristics.

Principal components are primarily abstract mathematical entities and further details are described in Chapter 4. In multivariate calibration the aim is to convert these to compound concentrations. PCR uses regression (sometimes also called transformation or rotation) to convert PC scores to concentrations. This process is often loosely called factor analysis, although terminology differs according to author and discipline. Note that although the chosen example in this chapter involves calibrating concentrations to spectral absorbances, it is equally possible, for example, to calibrate the property of a material to its structural features, or the activity of a drug to molecular parameters.

5.4.1 Regression

If cn is a vector containing the known concentration of compound n in the spectra (25 in this instance), then the PC scores matrix, T, can be related as follows:

cn T .rn

where rn is a column vector whose length equals the number of PCs retained, sometimes called a rotation or transformation vector. Ideally, the length of rn should be equal to the number of compounds in the mixture (= 10 in this case). However, noise, spectral similarities and correlations between concentrations often make it hard to provide an exact estimate of the number of significant components; this topic has been introduced in Section 4.3.3 of Chapter 4. We will assume, for the purpose of this section, that 10 PCs are employed in the model.

The scores of the first 10 PCs are presented in Table 5.10, using raw data. The transformation vector can be obtained by using the pseudo-inverse of T:

rn = (T .T )1T .cn

Note that the matrix (T .T ) is actually a diagonal matrix, whose elements consist of 10 eigenvalues of the PCs, and each element of r could be expressed as a summation:

I

tia cin

rna = i=1 A

ga

a=1

CALIBRATION

293

 

 

Table 5.10 Scores of the first 10 PCs for the PAH case study.

2.757

0.008

0.038

0.008

0.026

0.016

0.012

0.004

0.006

0.006

3.652

0.063

0.238

0.006

0.021

0.000

0.018

0.005

0.009

0.013

2.855

0.022

0.113

0.049

0.187

0.039

0.053

0.004

0.007

0.003

2.666

0.267

0.040

0.007

0.073

0.067

0.002

0.002

0.013

0.006

3.140

0.029

0.006

0.153

0.111

0.015

0.030

0.022

0.014

0.006

3.437

0.041

0.090

0.034

0.027

0.018

0.010

0.014

0.032

0.006

1.974

0.161

0.296

0.107

0.090

0.010

0.003

0.037

0.004

0.008

2.966

0.129

0.161

0.147

0.043

0.016

0.006

0.010

0.013

0.035

2.545

0.054

0.143

0.080

0.074

0.073

0.013

0.008

0.025

0.006

3.017

0.425

0.159

0.002

0.096

0.049

0.010

0.013

0.018

0.022

2.005

0.371

0.003

0.120

0.093

0.032

0.015

0.025

0.003

0.003

1.648

0.239

0.020

0.123

0.090

0.051

0.017

0.021

0.009

0.007

1.884

0.215

0.020

0.167

0.024

0.041

0.041

0.007

0.017

0.001

1.666

0.065

0.126

0.070

0.007

0.005

0.016

0.036

0.025

0.009

2.572

0.085

0.028

0.095

0.184

0.046

0.045

0.016

0.013

0.006

2.532

0.262

0.126

0.047

0.084

0.076

0.004

0.005

0.017

0.010

2.171

0.014

0.028

0.166

0.080

0.008

0.018

0.007

0.003

0.007

1.900

0.020

0.027

0.015

0.006

0.018

0.005

0.030

0.029

0.002

3.174

0.114

0.312

0.059

0.014

0.066

0.009

0.016

0.011

0.007

2.610

0.204

0.037

0.036

0.069

0.041

0.020

0.005

0.012

0.033

2.567

0.119

0.155

0.090

0.017

0.050

0.017

0.023

0.013

0.005

2.389

0.445

0.190

0.045

0.091

0.026

0.026

0.021

0.006

0.015

3.201

0.282

0.043

0.062

0.066

0.015

0.026

0.032

0.015

0.009

3.537

0.182

0.071

0.166

0.094

0.069

0.026

0.013

0.016

0.021

3.343

0.113

0.252

0.012

0.086

0.019

0.031

0.048

0.018

0.004

Table 5.11 Vector r for pyrene.

0.166

0.470

0.624 0.168 1.899 1.307 1.121 0.964 3.106 0.020

or even as a product of vectors:

rna = t a .cn

A

ga

a=1

We remind the reader of the main notation:

n refers to compound number (e.g. pyrene = 1);

a to PC number (e.g. 10 significant components, not necessarily equal to the number of compounds in a series of samples);

i to sample number (=1–25 in this case).

294

CHEMOMETRICS

 

 

This vector for pyrene using 10 PCs is presented in Table 5.11. If the concentrations of some or all the compounds are known, PCR can be extended simply by replacing the vector ck with a matrix C, each column corresponding to a compound in the mixture, so that

C T .R

and

R = (T .T )1.T .C

The number of PCs must be at least equal to the number of compounds of interest in the mixture. R has dimensions A × N .

If the number of PCs and number of significant compounds are equal, so that, in this example, T and C are 25 × 10 matrices, then R is a square matrix of dimensions

N × N and

X

=

T .P

=

T .R.R1

.P

C.Sˆ

 

ˆ

 

 

 

= ˆ

Hence, by calculating R1.P, it is possible to determine the estimated spectra of each individual component without knowing this information in advance, and by calculating T .R concentration estimates can be obtained. Table 5.12 provides the concentration estimates using PCR with 10 significant components. The percentage mean square

Table 5.12 Concentration estimates for the PAHs using PCR and 10 components (uncentred).

Spectrum

 

 

 

PAH concentration (mg l1)

 

 

 

No.

 

 

 

 

 

 

 

 

 

 

Py

Ace

Anth

Acy

Chry

Benz

Fluora

Fluore

Nap

Phen

 

 

 

 

 

 

 

 

 

 

 

 

1

0.505

0.113

0.198

0.131

0.375

1.716

0.128

0.618

0.094

0.445

2

0.467

0.120

0.286

0.113

0.455

2.686

0.137

0.381

0.168

0.782

3

0.161

0.178

0.296

0.174

0.558

1.647

0.094

0.836

0.162

0.161

4

0.682

0.177

0.231

0.165

0.354

1.119

0.123

0.720

0.049

0.740

5

0.810

0.128

0.297

0.156

0.221

2.154

0.159

0.316

0.189

0.482

6

0.575

0.170

0.159

0.107

0.428

2.240

0.072

0.942

0.146

1.000

7

0.782

0.152

0.104

0.152

0.470

0.454

0.162

0.477

0.169

0.220

8

0.401

0.111

0.192

0.170

0.097

2.153

0.182

1.014

0.062

0.322

9

0.284

0.084

0.237

0.106

0.429

1.668

0.166

0.241

0.022

0.331

10

0.578

0.197

0.023

0.157

0.321

2.700

0.077

0.194

0.090

0.300

11

0.609

0.075

0.194

0.103

0.550

0.460

0.080

0.472

0.038

0.656

12

0.185

0.172

0.183

0.147

0.083

0.558

0.086

0.381

0.101

0.701

13

0.555

0.103

0.241

0.092

0.084

1.104

0.007

0.576

0.156

0.475

14

0.461

0.167

0.067

0.089

0.212

0.624

0.111

0.812

0.114

0.304

15

0.770

0.019

0.076

0.089

0.115

1.669

0.178

0.393

0.068

0.884

16

0.109

0.033

0.101

0.040

0.349

2.189

0.108

0.352

0.190

0.376

17

0.178

0.102

0.073

0.086

0.481

1.057

0.112

0.805

0.073

0.486

18

0.271

0.067

0.145

0.104

0.221

1.077

0.183

0.273

0.142

0.250

19

0.186

0.135

0.217

0.101

0.253

2.618

0.071

0.369

0.036

0.919

20

0.406

0.109

0.111

0.145

0.534

1.126

0.111

0.306

0.220

0.973

21

0.665

0.110

0.152

0.130

0.284

1.541

0.044

0.720

0.165

0.614

22

0.315

0.112

0.258

0.092

0.336

0.501

0.205

0.981

0.162

1.009

23

0.327

0.179

0.126

0.115

0.126

2.610

0.160

0.847

0.161

0.537

24

0.766

0.075

0.168

0.029

0.525

2.692

0.121

1.139

0.124

0.383

25

0.333

0.110

0.053

0.210

0.539

2.151

0.135

0.709

0.086

0.738

E%

10.27

36.24

15.76

42.06

9.05

4.24

31.99

24.77

21.11

16.19

 

 

 

 

 

 

 

 

 

 

 

CALIBRATION

295

 

 

error of prediction (equalling the square root sum of squares of the errors of prediction divided by 15 minus the number of degrees of freedom which equals 25 10 to account for the number of components in the model, and not by 25) for all 10 compounds is also presented. In most cases it is slightly better than using MLR; there are certainly fewer very large errors. However, the major advantage is that the prediction using PCR is the same if only one or all 10 compounds are included in the model. In this it differs radically from MLR; the estimates in Table 5.9 are much worse than those in Table 5.8, for example. The first main task when using PCR is to determine how many significant components are necessary to model the data.

5.4.2 Quality of Prediction

A key issue in calibration is to determine how well the data have been modelled. We have used only one indicator above, but it is important to appreciate that there are many other potential statistics.

5.4.2.1 Modelling the c Block

Most look at how well the concentration is predicted, or the c (or according to some authors y) block of data.

The simplest method is to determine the sum of square of residuals between the true and predicted concentrations:

I

Sc = (cin cˆin)2

i=1

where

A

cˆin = tia ran a=1

for compound n using a principal components. The larger this error, the worse is the prediction, so the error decreases as more components are calculated.

Often the error is reported as a root mean square error:

 

 

 

 

I

(cin

 

cin)2

E

 

i 1

 

− ˆ

 

 

 

 

 

 

 

=

 

 

 

 

 

 

 

 

 

=

I

 

a

 

 

 

 

 

 

 

If the data are centred, a further degree of freedom is lost, so the sum of square residuals is divided by I a 1.

This error can also be presented as a percentage error:

E% = 100E/cn

where cn is the mean concentration in the original units. Sometimes the percentage of the standard deviation is calculated instead, but in this text we will compute errors as a percentage of the mean unless specifically stated otherwise.

296 CHEMOMETRICS

5.4.2.2 Modelling the x Block

It is also possible to report errors in terms of quality of modelling of spectra (or chromatograms), often called the x block error.

The quality of modelling of the spectra using PCA (the x variance) can likewise be calculated as follows:

I

J

Sx =

(xij xˆij )2

i=1 j =1

 

where

A

xˆij =

tia paj

 

a=1

 

 

However, this error also can be expressed in terms of eigenvalues or scores, so that

I J

A

I J

A

I

 

 

 

 

Sx =

xij2 ga =

i=1 j =1

xij2

tia2

i=1 j =1

a=1

a=1 i=1

for A principal components. These can be converted to root mean square errors as above:

E = Sx /I .J

Note that many people divide by I .J (= 25 × 27 = 675 in our case) rather than the more strictly correct I .J a (adjusting for degrees of freedom), because I .J is very large relative to a, and we will adopt this convention.

The percentage root mean square error may be defined by (for uncentred data)

E% = 100E/x

Note that if x is centred, the divisor is often given by

I J (xij xj ) 2

i=1

I .J

j =1

where xj is the average of all the measurements for the samples for variable j . Obviously there are several other ways of defining this error: if you try to follow a paper or a package, read very carefully the documents provided by the authors, and if there is no documentation, do not trust the answers.

Note that the x error depends only on the number of PCs, no matter how many compounds are being modelled, but the error in concentration estimates depends also on the specific compound, there being a different percentage error for each compound in the mixture. For 0 PCs, the estimates of the PCs and concentrations is simply 0 or, if mean-centred, the mean. The graphs of root mean square errors for both the concentration estimates of pyrene and spectra as increasing numbers of PCs are calculated are given in Figure 5.11, using a logarithmic scale for the error. Although the x error graph declines steeply, which might falsely suggest that only a small number of PCs

CALIBRATION

297

 

 

 

0.1

 

 

 

 

 

 

 

error

0.01

 

 

 

 

 

 

 

RMS

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.001

 

 

 

 

 

 

 

 

1

3

5

7

9

11

13

15

 

 

 

 

(a) x

block

 

 

 

 

 

 

 

Component number

 

 

 

 

1

 

 

 

 

 

 

 

error

0.1

 

 

 

 

 

 

 

RMS

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.01

 

 

 

 

 

 

 

 

1

3

5

7

9

11

13

15

 

 

 

 

(b) c

block

 

 

 

 

 

 

 

Component number

 

 

 

Figure 5.11

Root mean square errors of estimation of pyrene using uncentred PCR

are required for the model, the c error graph exhibits a much gentler decline. Sometimes these graphs are presented either as percentage variance remaining (or explained by each PC) or eigenvalues.

5.5 Partial Least Squares

PLS is often presented as the major regression technique for multivariate data. In fact its use is not always justified by the data, and the originators of the method were well aware of this, but, that being said, in some applications PLS has been spectacularly successful. In some areas such as QSAR, or even biometrics and psychometrics,

298

CHEMOMETRICS

 

 

PLS is an invaluable tool, because the underlying factors have little or no physical meaning so a linearly additive model in which each underlying factor can be interpreted chemically is not anticipated. In spectroscopy of chromatography we usually expect linear additivity, and this is especially important for chemical instrumental data, and under such circumstances simpler methods such as MLR are often useful provided that there is a fairly full knowledge of the system. However, PLS is always an important tool when there is partial knowledge of the data, a well known example being the measurement of protein in wheat by NIR spectroscopy. A model can be obtained from a series of wheat samples, and PLS will use typical features in this dataset to establish a relationship to the known amount of protein. PLS models can be very robust provided that future samples contain similar features to the original data, but the predictions are essentially statistical. Another example is the determination of vitamin C in orange juices using spectroscopy: a very reliable PLS model could be obtained using orange juices from a particular region of Spain, but what if some Brazilian orange juice is included? There is no guarantee that the model will perform well on the new data, as there may be different spectral features, so it is always important to be aware of the limitations of the method, particularly to remember that the use of PLS cannot compensate for poorly designed experiments or inadequate experimental data.

An important feature of PLS is that it takes into account errors in both the concentration estimates and the spectra. A method such as PCR assumes that the concentration estimates are error free. Much traditional statistics rests on this assumption, that all errors are of the variables (spectra). If in medicine it is decided to determine the concentration of a compound in the urine of patients as a function of age, it is assumed that age can be estimated exactly, the statistical variation being in the concentration of a compound and the nature of the urine sample. Yet in chemistry there are often significant errors in sample preparation, for example accuracy of weighings and dilutions, and so the independent variable in itself also contains errors. Classical and inverse calibration force the user to choose which variable contains the error, whereas PLS assumes that it is equally distributed in both the x and c blocks.

5.5.1 PLS1

The most widespread approach is often called PLS1. Although there are several algorithms, the main ones due to Wold and Martens, the overall principles are fairly straightforward. Instead of modelling exclusively the x variables, two sets of models

are obtained as follows:

X = T .P + E

c = T .q + f

where q has analogies to a loadings vector, although is not normalised. These matrices are represented in Figure 5.12. The product of T and P approximates to the spectral data and the product of T and q to the true concentrations; the common link is T. An important feature of PLS is that it is possible to obtain a scores matrix that is common to both the concentrations (c) and measurements (x). Note that T and P for PLS are different to T and P obtained in PCA, and unique sets of scores and loadings are obtained for each compound in the dataset. Hence if there are 10 compounds

CALIBRATION

 

 

 

 

 

 

 

 

 

 

 

 

299

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

J

 

 

A

J

J

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I

 

 

 

 

= I

 

 

 

A

 

P

+ I

 

 

 

 

 

 

 

X

 

T

 

 

 

 

 

E

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

A

 

 

1

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

q

 

 

 

 

 

 

I

 

 

 

=

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

c

I

 

T

 

 

+

I

 

 

 

f

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.12

Principles of PLS1

of interest, there will be 10 sets of T, P and q. In this way PLS differs from PCR in which there is only one set of T and P, the PCA step taking no account of the c block. It is important to recognise that there are several algorithms for PLS available in the literature, and although the predictions of c are the same in each case, the scores and loadings are not. In this book and the associated Excel software, we use the algorithm of Appendix A.2.2. Although the scores are orthogonal (as in PCA), the loadings are not (which is an important difference to PCA), and, furthermore, the loadings are not normalised, so the sum of squares of each p vector does not equal one. If you are using a commercial software package, it is important to be check exactly what constraints and assumptions the authors make about the scores and loadings in PLS.

Additionally, the analogy to ga or the eigenvalue of a PC involves multiplying the sum of squares of both ta and pa together, so we define the magnitude of a PLS component as

ga =

 

I

tia2

 

paj2

 

 

 

J

 

 

 

 

 

 

 

 

i=1

 

j =1

 

This will have the property that the sum of values of ga for all nonzero components add up to the sum of squares of the original (preprocessed) data. Note that in contrast to PCA, the size of successive values of ga does not necessarily decrease as each component is calculated. This is because PLS does not only model the x data, and is a compromise between x and c block regression.

There are a number of alternative ways of presenting the PLS regression equations in the literature, all mathematically equivalent. In the models above, we obtain three arrays T, P and q. Some authors calculate a normalised vector, w, proportional to q, and the second equation becomes

c = T .B.w + f

300

CHEMOMETRICS

 

 

where B is a diagonal matrix. Analogously, it is also possible to the define the PCA decomposition of x as a product of three arrays (the SVD method which is used in Matlab is a common alternative to NIPALS), but the models used in this chapter have the simplicity of using a single scores matrix for both blocks of data, and modelling each dataset using two matrices.

For a dataset consisting of 25 spectra observed at 27 wavelengths, for which eight PLS components are calculated, there will be

a T matrix of dimensions 25 × 8;

a P matrix of dimensions 8 × 27;

an E matrix of dimensions 25 × 27;

a q vector of dimensions 8 × 1;

an f vector of dimensions 25× 1.

Note that there will be 10 separate sets of these arrays, in the case discussed in this chapter, one for each compound in the mixture, and that the T matrix will be compound dependent, which differs from PCR.

Each successive PLS component approximates both the concentration and spectral data better. For each PLS component, there will be a

spectral scores vector t;

spectral loadings vector p;

concentration loadings scalar q.

In most implementations of PLS it is conventional to centre both the x and c data, by subtracting the mean of each column, before analysis. In fact, there is no general scientific need to do this. Many spectroscopists and chromatographers perform PCA uncentred, but many early applications of PLS (e.g. outside chemistry) were of such a nature that centring the data was appropriate. Many of the historical developments in PLS as used for multivariate calibration in chemistry relate to applications in NIR spectroscopy, where there are specific spectroscopic problems, such as due to baselines, which, in turn would favour centring. However, as generally applied in many branches of chemistry, uncentred PLS is perfectly acceptable. Below, though, we use the most widespread implementation (involving centring) for the sake of compatibility with the most common computational implementations of the method.

For a given compound, the remaining percentage error in the x matrix for a PLS components can be expressed in a variety of ways as discussed in Section 5.4.2.2. Note that there are slight differences according to authors that take into account the number of degrees of freedom left in the model. The predicted measurements simply

ˆ =

involve calculating X T .P and adding on the column means where appropriate, and error indicators in the x block that can defined similarly to those used in PCR, see Section 5.4.2.2. The only difference is that each compound generates a separate scores matrix, unlike PCR where there is a single scores matrix for all compounds in the mixture and so there will be a different behaviour in the x block residuals according to compound.

The concentration of compound n is predicted by

A

cˆin = tianqan + cn a=1

CALIBRATION

301

 

 

Table 5.13 Calculation of concentration estimates for pyrene using two PLS components.

 

Component 1

 

Component 2

 

Estimated

 

q = 0.222

 

 

q = 0.779

 

concentration

 

Scores

ti1q

 

Scores

ti2q

ti1q + ti2q

ti1q + ti2q +

 

 

 

c

0.088

0.020

0.052

0.050

0.058

0.514

 

0.532

0.118

0.139

0.133

0.014

0.470

 

0.041

0.009

0.169

0.162

0.117

0.339

 

0.143

0.032

0.334

0.319

0.281

0.737

 

0.391

0.087

0.226

0.216

0.255

0.711

 

0.457

0.102

0.002

0.002

0.100

0.556

 

0.232

0.052

0.388

0.371

0.238

0.694

 

0.191

0.042

0.008

0.007

0.037

0.493

 

0.117

0.026

0.148

0.142

0.137

0.319

 

0.189

0.042

0.136

0.130

0.059

0.397

 

0.250

0.055

0.333

0.319

0.193

0.649

 

0.621

0.138

0.046

0.044

0.173

0.283

 

0.412

0.092

0.105

0.101

0.013

0.443

 

0.575

0.128

0.004

0.004

0.125

0.331

 

0.076

0.017

0.264

0.253

0.214

0.670

 

0.264

0.059

0.485

0.464

0.420

0.036

 

0.358

0.080

0.173

0.165

0.209

0.247

 

0.485

0.108

0.117

0.112

0.195

0.261

 

0.162

0.036

0.356

0.340

0.229

0.227

 

0.008

0.002

0.105

0.100

0.080

0.536

 

0.038

0.008

0.209

0.200

0.164

0.620

 

0.148

0.033

0.080

0.076

0.026

0.482

 

0.197

0.044

0.329

0.315

0.201

0.255

 

0.518

0.115

0.041

0.039

0.085

0.541

 

 

0.432

0.096

0.050

0.048

0.133

0.589

 

or, in matrix terms

cn cn = Tn.qn

where cn is a vector of the average concentration. Hence the scores of each PLS component are proportional to the contribution of the component to the concentration estimate. The method of the concentration estimation for two PLS components for pyrene is presented in Table 5.13.

The mean square error in the concentration estimate is defined just as in PCR. It is also possible to define this error in various different ways using t and q. In the case of the c block estimates, it is usual to divide the sum of squares by I A 1. These error terms have been discussed in greater detail in Section 5.4.2. The x block is usually mean centred and so to obtain a percentage error most people divide by the standard deviation, whereas for the c block the estimates are generally expressed in the original concentration units, so we will retain the convention of dividing by the mean concentration unless there is a specific reason for another approach. As in all areas of chemometrics, each group and software developer has their own favourite way of calculating parameters, so it is essential never to accept output from a package blindly.

The calculation of x block error is presented for the case of pyrene. Table 5.14 gives the magnitudes of the first 15 PLS components, defined as the product of the sum of squares for t and p of each component. The total sum of squares of the

Соседние файлы в предмете Химия