Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Казанский национальный исследовательский технологический университет

Предмет:

Химия

Файл:

Brereton Chemometrics

.pdf

Скачиваний:

Добавлен:

15.08.2013

Размер:

4.3 Mб

Скачать

☆

<<< < Предыдущая 19 20 21 22 23 24 25 26 27 28 29 3031 / 5031 32 33 34 35 36 37 38 39 40 41 42 43 > Следующая >>>

292	CHEMOMETRICS

wavelength selection, so enabling inverse models to be produced, but there is no real advantage over classical least squares in these situations.

5.4 Principal Components Regression

MLR based methods have the disadvantage that all signiﬁcant components must be known. PCA based methods do not require details about the spectra or concentrations of all the compounds in a mixture, although it is important to make a sensible estimate of how many signiﬁcant components characterise a mixture, but not necessarily their characteristics.

Principal components are primarily abstract mathematical entities and further details are described in Chapter 4. In multivariate calibration the aim is to convert these to compound concentrations. PCR uses regression (sometimes also called transformation or rotation) to convert PC scores to concentrations. This process is often loosely called factor analysis, although terminology differs according to author and discipline. Note that although the chosen example in this chapter involves calibrating concentrations to spectral absorbances, it is equally possible, for example, to calibrate the property of a material to its structural features, or the activity of a drug to molecular parameters.

5.4.1 Regression

If cn is a vector containing the known concentration of compound n in the spectra (25 in this instance), then the PC scores matrix, T, can be related as follows:

cn ≈ T .rn

where rn is a column vector whose length equals the number of PCs retained, sometimes called a rotation or transformation vector. Ideally, the length of rn should be equal to the number of compounds in the mixture (= 10 in this case). However, noise, spectral similarities and correlations between concentrations often make it hard to provide an exact estimate of the number of signiﬁcant components; this topic has been introduced in Section 4.3.3 of Chapter 4. We will assume, for the purpose of this section, that 10 PCs are employed in the model.

The scores of the ﬁrst 10 PCs are presented in Table 5.10, using raw data. The transformation vector can be obtained by using the pseudo-inverse of T:

rn = (T .T )−1T .cn

Note that the matrix (T .T ) is actually a diagonal matrix, whose elements consist of 10 eigenvalues of the PCs, and each element of r could be expressed as a summation:

tia cin

rna = i=1 A

a=1

CALIBRATION	293

Table 5.10 Scores of the ﬁrst 10 PCs for the PAH case study.

2.757	0.008	0.038	0.008	0.026	0.016	0.012	−0.004	0.006	−0.006
3.652	−0.063	−0.238	−0.006	0.021	0.000	0.018	0.005	0.009	0.013
2.855	−0.022	0.113	0.049	−0.187	0.039	0.053	0.004	0.007	−0.003
2.666	0.267	0.040	−0.007	0.073	0.067	−0.002	−0.002	−0.013	−0.006
3.140	0.029	0.006	−0.153	0.111	−0.015	0.030	0.022	0.014	0.006
3.437	0.041	−0.090	0.034	−0.027	−0.018	−0.010	−0.014	−0.032	0.006
1.974	0.161	0.296	0.107	0.090	−0.010	0.003	0.037	0.004	0.008
2.966	−0.129	0.161	−0.147	−0.043	0.016	−0.006	−0.010	0.013	−0.035
2.545	−0.054	−0.143	0.080	0.074	0.073	0.013	0.008	0.025	−0.006
3.017	−0.425	0.159	0.002	0.096	0.049	−0.010	0.013	−0.018	0.022
2.005	0.371	0.003	0.120	0.093	0.032	0.015	−0.025	0.003	0.003
1.648	0.239	−0.020	−0.123	−0.090	0.051	−0.017	0.021	−0.009	0.007
1.884	0.215	0.020	−0.167	−0.024	−0.041	0.041	−0.007	−0.017	0.001
1.666	0.065	0.126	0.070	−0.007	−0.005	−0.016	0.036	−0.025	−0.009
2.572	0.085	−0.028	−0.095	0.184	−0.046	−0.045	−0.016	0.013	−0.006
2.532	−0.262	−0.126	0.047	−0.084	−0.076	0.004	0.005	0.017	0.010
2.171	0.014	0.028	0.166	−0.080	0.008	−0.018	−0.007	−0.003	−0.007
1.900	−0.020	0.027	−0.015	−0.006	−0.018	−0.005	0.030	0.029	−0.002
3.174	−0.114	−0.312	−0.059	−0.014	0.066	−0.009	−0.016	−0.011	0.007
2.610	0.204	0.037	0.036	−0.069	−0.041	−0.020	−0.005	0.012	0.033
2.567	0.119	0.155	−0.090	−0.017	−0.050	0.017	−0.023	−0.013	0.005
2.389	0.445	−0.190	0.045	−0.091	−0.026	−0.026	0.021	0.006	−0.015
3.201	−0.282	−0.043	−0.062	−0.066	−0.015	−0.026	0.032	−0.015	−0.009
3.537	−0.182	−0.071	0.166	0.094	−0.069	0.026	−0.013	−0.016	−0.021
3.343	−0.113	0.252	0.012	−0.086	0.019	−0.031	−0.048	0.018	0.004

Table 5.11 Vector r for pyrene.

0.166

0.470

0.624 −0.168 1.899 −1.307 1.121 0.964 −3.106 −0.020

or even as a product of vectors:

rna = t a .cn

a=1

We remind the reader of the main notation:

•n refers to compound number (e.g. pyrene = 1);

•a to PC number (e.g. 10 signiﬁcant components, not necessarily equal to the number of compounds in a series of samples);

•i to sample number (=1–25 in this case).

294	CHEMOMETRICS

This vector for pyrene using 10 PCs is presented in Table 5.11. If the concentrations of some or all the compounds are known, PCR can be extended simply by replacing the vector ck with a matrix C, each column corresponding to a compound in the mixture, so that

C ≈ T .R

and

R = (T .T )−1.T .C

The number of PCs must be at least equal to the number of compounds of interest in the mixture. R has dimensions A × N .

If the number of PCs and number of signiﬁcant compounds are equal, so that, in this example, T and C are 25 × 10 matrices, then R is a square matrix of dimensions

N × N and	X	=	T .P	=	T .R.R−1	.P	C.Sˆ
	ˆ	=		=			= ˆ

Hence, by calculating R−1.P, it is possible to determine the estimated spectra of each individual component without knowing this information in advance, and by calculating T .R concentration estimates can be obtained. Table 5.12 provides the concentration estimates using PCR with 10 signiﬁcant components. The percentage mean square

Table 5.12 Concentration estimates for the PAHs using PCR and 10 components (uncentred).

Spectrum				PAH concentration (mg l−1)
No.
No.	Py	Ace	Anth	Acy	Chry	Benz	Fluora	Fluore	Nap	Phen
	Py	Ace	Anth	Acy	Chry	Benz	Fluora	Fluore	Nap	Phen

1	0.505	0.113	0.198	0.131	0.375	1.716	0.128	0.618	0.094	0.445
2	0.467	0.120	0.286	0.113	0.455	2.686	0.137	0.381	0.168	0.782
3	0.161	0.178	0.296	0.174	0.558	1.647	0.094	0.836	0.162	0.161
4	0.682	0.177	0.231	0.165	0.354	1.119	0.123	0.720	0.049	0.740
5	0.810	0.128	0.297	0.156	0.221	2.154	0.159	0.316	0.189	0.482
6	0.575	0.170	0.159	0.107	0.428	2.240	0.072	0.942	0.146	1.000
7	0.782	0.152	0.104	0.152	0.470	0.454	0.162	0.477	0.169	0.220
8	0.401	0.111	0.192	0.170	0.097	2.153	0.182	1.014	0.062	0.322
9	0.284	0.084	0.237	0.106	0.429	1.668	0.166	0.241	0.022	0.331
10	0.578	0.197	0.023	0.157	0.321	2.700	0.077	0.194	0.090	0.300
11	0.609	0.075	0.194	0.103	0.550	0.460	0.080	0.472	0.038	0.656
12	0.185	0.172	0.183	0.147	0.083	0.558	0.086	0.381	0.101	0.701
13	0.555	0.103	0.241	0.092	0.084	1.104	0.007	0.576	0.156	0.475
14	0.461	0.167	0.067	0.089	0.212	0.624	0.111	0.812	0.114	0.304
15	0.770	0.019	0.076	0.089	0.115	1.669	0.178	0.393	0.068	0.884
16	0.109	0.033	0.101	0.040	0.349	2.189	0.108	0.352	0.190	0.376
17	0.178	0.102	0.073	0.086	0.481	1.057	0.112	0.805	0.073	0.486
18	0.271	0.067	0.145	0.104	0.221	1.077	0.183	0.273	0.142	0.250
19	0.186	0.135	0.217	0.101	0.253	2.618	0.071	0.369	0.036	0.919
20	0.406	0.109	0.111	0.145	0.534	1.126	0.111	0.306	0.220	0.973
21	0.665	0.110	0.152	0.130	0.284	1.541	0.044	0.720	0.165	0.614
22	0.315	0.112	0.258	0.092	0.336	0.501	0.205	0.981	0.162	1.009
23	0.327	0.179	0.126	0.115	0.126	2.610	0.160	0.847	0.161	0.537
24	0.766	0.075	0.168	0.029	0.525	2.692	0.121	1.139	0.124	0.383
25	0.333	0.110	0.053	0.210	0.539	2.151	0.135	0.709	0.086	0.738
E%	10.27	36.24	15.76	42.06	9.05	4.24	31.99	24.77	21.11	16.19

CALIBRATION	295

error of prediction (equalling the square root sum of squares of the errors of prediction divided by 15 minus the number of degrees of freedom which equals 25 − 10 to account for the number of components in the model, and not by 25) for all 10 compounds is also presented. In most cases it is slightly better than using MLR; there are certainly fewer very large errors. However, the major advantage is that the prediction using PCR is the same if only one or all 10 compounds are included in the model. In this it differs radically from MLR; the estimates in Table 5.9 are much worse than those in Table 5.8, for example. The ﬁrst main task when using PCR is to determine how many signiﬁcant components are necessary to model the data.

5.4.2 Quality of Prediction

A key issue in calibration is to determine how well the data have been modelled. We have used only one indicator above, but it is important to appreciate that there are many other potential statistics.

5.4.2.1 Modelling the c Block

Most look at how well the concentration is predicted, or the c (or according to some authors y) block of data.

The simplest method is to determine the sum of square of residuals between the true and predicted concentrations:

Sc = (cin − cˆin)2

i=1

where

cˆin = tia ran a=1

for compound n using a principal components. The larger this error, the worse is the prediction, so the error decreases as more components are calculated.

Often the error is reported as a root mean square error:

				I	(cin		cin)2
E		i 1				− ˆ
E
	=
	=		=		I		a
					−

If the data are centred, a further degree of freedom is lost, so the sum of square residuals is divided by I − a − 1.

This error can also be presented as a percentage error:

E% = 100E/cn

where cn is the mean concentration in the original units. Sometimes the percentage of the standard deviation is calculated instead, but in this text we will compute errors as a percentage of the mean unless speciﬁcally stated otherwise.

296 CHEMOMETRICS

5.4.2.2 Modelling the x Block

It is also possible to report errors in terms of quality of modelling of spectra (or chromatograms), often called the x block error.

The quality of modelling of the spectra using PCA (the x variance) can likewise be calculated as follows:

	I	J
	Sx =	(xij − xˆij )2
	i=1 j =1

	where	A
	xˆij =	A
	xˆij =	tia paj
		a=1

However, this error also can be expressed in terms of eigenvalues or scores, so that

I J	A	I J	A	I

Sx =	xij2 − ga =	i=1 j =1	xij2 −	tia2
i=1 j =1	a=1	i=1 j =1	a=1 i=1

for A principal components. These can be converted to root mean square errors as above:

E = Sx /I .J

Note that many people divide by I .J (= 25 × 27 = 675 in our case) rather than the more strictly correct I .J − a (adjusting for degrees of freedom), because I .J is very large relative to a, and we will adopt this convention.

The percentage root mean square error may be deﬁned by (for uncentred data)

E% = 100E/x

Note that if x is centred, the divisor is often given by

I J (xij − xj ) 2

i=1	I .J
	j =1

where xj is the average of all the measurements for the samples for variable j . Obviously there are several other ways of deﬁning this error: if you try to follow a paper or a package, read very carefully the documents provided by the authors, and if there is no documentation, do not trust the answers.

Note that the x error depends only on the number of PCs, no matter how many compounds are being modelled, but the error in concentration estimates depends also on the speciﬁc compound, there being a different percentage error for each compound in the mixture. For 0 PCs, the estimates of the PCs and concentrations is simply 0 or, if mean-centred, the mean. The graphs of root mean square errors for both the concentration estimates of pyrene and spectra as increasing numbers of PCs are calculated are given in Figure 5.11, using a logarithmic scale for the error. Although the x error graph declines steeply, which might falsely suggest that only a small number of PCs

CALIBRATION	297

	0.1
error	0.01
RMS	0.01
RMS
	0.001
	1	3	5	7	9	11	13	15
				(a) x	block
				Component number

	1
error	0.1
RMS	0.1
RMS
	0.01
	1	3	5	7	9	11	13	15
				(b) c	block
				Component number

Figure 5.11

Root mean square errors of estimation of pyrene using uncentred PCR

are required for the model, the c error graph exhibits a much gentler decline. Sometimes these graphs are presented either as percentage variance remaining (or explained by each PC) or eigenvalues.

5.5 Partial Least Squares

PLS is often presented as the major regression technique for multivariate data. In fact its use is not always justiﬁed by the data, and the originators of the method were well aware of this, but, that being said, in some applications PLS has been spectacularly successful. In some areas such as QSAR, or even biometrics and psychometrics,

298	CHEMOMETRICS

PLS is an invaluable tool, because the underlying factors have little or no physical meaning so a linearly additive model in which each underlying factor can be interpreted chemically is not anticipated. In spectroscopy of chromatography we usually expect linear additivity, and this is especially important for chemical instrumental data, and under such circumstances simpler methods such as MLR are often useful provided that there is a fairly full knowledge of the system. However, PLS is always an important tool when there is partial knowledge of the data, a well known example being the measurement of protein in wheat by NIR spectroscopy. A model can be obtained from a series of wheat samples, and PLS will use typical features in this dataset to establish a relationship to the known amount of protein. PLS models can be very robust provided that future samples contain similar features to the original data, but the predictions are essentially statistical. Another example is the determination of vitamin C in orange juices using spectroscopy: a very reliable PLS model could be obtained using orange juices from a particular region of Spain, but what if some Brazilian orange juice is included? There is no guarantee that the model will perform well on the new data, as there may be different spectral features, so it is always important to be aware of the limitations of the method, particularly to remember that the use of PLS cannot compensate for poorly designed experiments or inadequate experimental data.

An important feature of PLS is that it takes into account errors in both the concentration estimates and the spectra. A method such as PCR assumes that the concentration estimates are error free. Much traditional statistics rests on this assumption, that all errors are of the variables (spectra). If in medicine it is decided to determine the concentration of a compound in the urine of patients as a function of age, it is assumed that age can be estimated exactly, the statistical variation being in the concentration of a compound and the nature of the urine sample. Yet in chemistry there are often signiﬁcant errors in sample preparation, for example accuracy of weighings and dilutions, and so the independent variable in itself also contains errors. Classical and inverse calibration force the user to choose which variable contains the error, whereas PLS assumes that it is equally distributed in both the x and c blocks.

5.5.1 PLS1

The most widespread approach is often called PLS1. Although there are several algorithms, the main ones due to Wold and Martens, the overall principles are fairly straightforward. Instead of modelling exclusively the x variables, two sets of models

are obtained as follows:

X = T .P + E

c = T .q + f

where q has analogies to a loadings vector, although is not normalised. These matrices are represented in Figure 5.12. The product of T and P approximates to the spectral data and the product of T and q to the true concentrations; the common link is T. An important feature of PLS is that it is possible to obtain a scores matrix that is common to both the concentrations (c) and measurements (x). Note that T and P for PLS are different to T and P obtained in PCA, and unique sets of scores and loadings are obtained for each compound in the dataset. Hence if there are 10 compounds

CALIBRATION

299

= I

+ I

Figure 5.12

Principles of PLS1

of interest, there will be 10 sets of T, P and q. In this way PLS differs from PCR in which there is only one set of T and P, the PCA step taking no account of the c block. It is important to recognise that there are several algorithms for PLS available in the literature, and although the predictions of c are the same in each case, the scores and loadings are not. In this book and the associated Excel software, we use the algorithm of Appendix A.2.2. Although the scores are orthogonal (as in PCA), the loadings are not (which is an important difference to PCA), and, furthermore, the loadings are not normalised, so the sum of squares of each p vector does not equal one. If you are using a commercial software package, it is important to be check exactly what constraints and assumptions the authors make about the scores and loadings in PLS.

Additionally, the analogy to ga or the eigenvalue of a PC involves multiplying the sum of squares of both ta and pa together, so we deﬁne the magnitude of a PLS component as

ga =	I	tia2		paj2
	I		J

	i=1		j =1

This will have the property that the sum of values of ga for all nonzero components add up to the sum of squares of the original (preprocessed) data. Note that in contrast to PCA, the size of successive values of ga does not necessarily decrease as each component is calculated. This is because PLS does not only model the x data, and is a compromise between x and c block regression.

There are a number of alternative ways of presenting the PLS regression equations in the literature, all mathematically equivalent. In the models above, we obtain three arrays T, P and q. Some authors calculate a normalised vector, w, proportional to q, and the second equation becomes

c = T .B.w + f

300	CHEMOMETRICS

where B is a diagonal matrix. Analogously, it is also possible to the deﬁne the PCA decomposition of x as a product of three arrays (the SVD method which is used in Matlab is a common alternative to NIPALS), but the models used in this chapter have the simplicity of using a single scores matrix for both blocks of data, and modelling each dataset using two matrices.

For a dataset consisting of 25 spectra observed at 27 wavelengths, for which eight PLS components are calculated, there will be

•a T matrix of dimensions 25 × 8;

•a P matrix of dimensions 8 × 27;

•an E matrix of dimensions 25 × 27;

•a q vector of dimensions 8 × 1;

•an f vector of dimensions 25× 1.

Note that there will be 10 separate sets of these arrays, in the case discussed in this chapter, one for each compound in the mixture, and that the T matrix will be compound dependent, which differs from PCR.

Each successive PLS component approximates both the concentration and spectral data better. For each PLS component, there will be a

•spectral scores vector t;

•spectral loadings vector p;

•concentration loadings scalar q.

In most implementations of PLS it is conventional to centre both the x and c data, by subtracting the mean of each column, before analysis. In fact, there is no general scientiﬁc need to do this. Many spectroscopists and chromatographers perform PCA uncentred, but many early applications of PLS (e.g. outside chemistry) were of such a nature that centring the data was appropriate. Many of the historical developments in PLS as used for multivariate calibration in chemistry relate to applications in NIR spectroscopy, where there are speciﬁc spectroscopic problems, such as due to baselines, which, in turn would favour centring. However, as generally applied in many branches of chemistry, uncentred PLS is perfectly acceptable. Below, though, we use the most widespread implementation (involving centring) for the sake of compatibility with the most common computational implementations of the method.

For a given compound, the remaining percentage error in the x matrix for a PLS components can be expressed in a variety of ways as discussed in Section 5.4.2.2. Note that there are slight differences according to authors that take into account the number of degrees of freedom left in the model. The predicted measurements simply

ˆ =

involve calculating X T .P and adding on the column means where appropriate, and error indicators in the x block that can deﬁned similarly to those used in PCR, see Section 5.4.2.2. The only difference is that each compound generates a separate scores matrix, unlike PCR where there is a single scores matrix for all compounds in the mixture and so there will be a different behaviour in the x block residuals according to compound.

The concentration of compound n is predicted by

cˆin = tianqan + cn a=1

CALIBRATION	301

Table 5.13 Calculation of concentration estimates for pyrene using two PLS components.


	Component 1			Component 2			Estimated
	q = 0.222			q = 0.779		concentration
	Scores	ti1q		Scores	ti2q	ti1q + ti2q	ti1q + ti2q +
								c
0.088		0.020	0.052		0.050	0.058	0.514
0.532		0.118	−0.139		−0.133	0.014	0.470
0.041		0.009	−0.169		−0.162	−0.117	0.339
0.143		0.032	0.334		0.319	0.281	0.737
0.391		0.087	0.226		0.216	0.255	0.711
0.457		0.102	−0.002		−0.002	0.100	0.556
−0.232		−0.052	0.388		0.371	0.238	0.694
0.191		0.042	−0.008		−0.007	0.037	0.493
−0.117		−0.026	−0.148		−0.142	−0.137	0.319
0.189		0.042	−0.136		−0.130	−0.059	0.397
−0.250		−0.055	0.333		0.319	0.193	0.649
−0.621		−0.138	−0.046		−0.044	−0.173	0.283
−0.412		−0.092	0.105		0.101	−0.013	0.443
−0.575		−0.128	0.004		0.004	−0.125	0.331
0.076		0.017	0.264		0.253	0.214	0.670
−0.264		−0.059	−0.485		−0.464	−0.420	0.036
−0.358		−0.080	−0.173		−0.165	−0.209	0.247
−0.485		−0.108	−0.117		−0.112	−0.195	0.261
0.162		0.036	−0.356		−0.340	−0.229	0.227
0.008		0.002	0.105		0.100	0.080	0.536
0.038		0.008	0.209		0.200	0.164	0.620
−0.148		−0.033	0.080		0.076	0.026	0.482
0.197		0.044	−0.329		−0.315	−0.201	0.255
0.518		0.115	−0.041		−0.039	0.085	0.541
	0.432	0.096	0.050		0.048	0.133	0.589

or, in matrix terms

cn − cn = Tn.qn

where cn is a vector of the average concentration. Hence the scores of each PLS component are proportional to the contribution of the component to the concentration estimate. The method of the concentration estimation for two PLS components for pyrene is presented in Table 5.13.

The mean square error in the concentration estimate is deﬁned just as in PCR. It is also possible to deﬁne this error in various different ways using t and q. In the case of the c block estimates, it is usual to divide the sum of squares by I − A − 1. These error terms have been discussed in greater detail in Section 5.4.2. The x block is usually mean centred and so to obtain a percentage error most people divide by the standard deviation, whereas for the c block the estimates are generally expressed in the original concentration units, so we will retain the convention of dividing by the mean concentration unless there is a speciﬁc reason for another approach. As in all areas of chemometrics, each group and software developer has their own favourite way of calculating parameters, so it is essential never to accept output from a package blindly.

The calculation of x block error is presented for the case of pyrene. Table 5.14 gives the magnitudes of the ﬁrst 15 PLS components, deﬁned as the product of the sum of squares for t and p of each component. The total sum of squares of the

<<< < Предыдущая 19 20 21 22 23 24 25 26 27 28 29 3031 / 5031 32 33 34 35 36 37 38 39 40 41 42 43 > Следующая >>>

Соседние файлы в предмете Химия

#
15.08.20134.29 Mб17Baer M., Billing G.D. (eds.) - The role of degenerate states in chemistry (Adv.Chem.Phys. special issue, Wiley, 2002).pdf
#
15.08.20137.08 Mб55Basov N.I. i dr. Raschet i konstruirovanie formiruyushchego instrumenta dlya izgotovleniya izdelij (1991.pdf
#
15.08.20135.59 Mб68Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
#
15.08.2013324.82 Кб32benzyne-cyclization.pdf
#
15.08.201314.48 Mб18Borowko M. 2000 Computational methods in surface and colloid science.djvu
#
15.08.20134.3 Mб48Brereton Chemometrics.pdf
#
15.08.20131.07 Mб30Burshtejn K.Ya., Shorygin P.P. Kvantovohimicheskie raschety v organicheskoj himii i molekulyarnoj.pdf
#
15.08.201321.36 Mб45Carey F.A. - Organic Chemistry (2004)(en).djvu
#
15.08.201321.36 Mб39Carey F.A. Advanced organic chemistry 5ed., MGH, 2004.djvu
#
15.08.201311.62 Mб23Carey F.A. Advanced organic chemistry. Part A structure and mechanisms 1938.djvu
#
15.08.20138.77 Mб17Carey F.A. Advanced organic chemistry. Part B reaction and synthesis 1938.djvu