Brereton Chemometrics
.pdfEVOLUTIONARY SIGNALS |
|
|
|
|
|
|
|
|
403 |
|
|
|
|
|
|
|
|
||
(continued from p. 401) |
|
|
|
|
|
|
|
||
0.159 |
0.281 |
0.431 |
0.192 |
0.488 |
0.335 |
0.196 |
0.404 |
0.356 |
0.265 |
0.076 |
0.341 |
0.629 |
0.294 |
0.507 |
0.442 |
0.252 |
0.592 |
0.352 |
0.196 |
0.138 |
0.581 |
0.883 |
0.351 |
0.771 |
0.714 |
0.366 |
0.805 |
0.548 |
0.220 |
0.223 |
0.794 |
1.198 |
0.543 |
0.968 |
0.993 |
0.494 |
1.239 |
0.766 |
0.216 |
0.367 |
0.865 |
1.439 |
0.562 |
1.118 |
1.130 |
0.578 |
1.488 |
0.837 |
0.220 |
0.310 |
0.995 |
1.505 |
0.572 |
1.188 |
1.222 |
0.558 |
1.550 |
0.958 |
0.276 |
0.355 |
0.895 |
1.413 |
0.509 |
1.113 |
1.108 |
0.664 |
1.423 |
0.914 |
0.308 |
0.284 |
0.723 |
1.255 |
0.501 |
0.957 |
0.951 |
0.520 |
1.194 |
0.778 |
0.219 |
0.350 |
0.593 |
0.948 |
0.478 |
0.738 |
0.793 |
0.459 |
0.904 |
0.648 |
0.177 |
0.383 |
0.409 |
0.674 |
0.454 |
0.555 |
0.629 |
0.469 |
0.684 |
0.573 |
0.126 |
0.488 |
0.220 |
0.620 |
0.509 |
0.494 |
0.554 |
0.580 |
0.528 |
0.574 |
0.165 |
0.695 |
0.200 |
0.492 |
0.551 |
0.346 |
0.454 |
0.695 |
0.426 |
0.584 |
0.177 |
0.877 |
0.220 |
0.569 |
0.565 |
0.477 |
0.582 |
0.747 |
0.346 |
0.685 |
0.168 |
0.785 |
0.230 |
0.486 |
0.724 |
0.346 |
0.601 |
0.810 |
0.370 |
0.748 |
0.147 |
0.773 |
0.204 |
0.435 |
0.544 |
0.321 |
0.442 |
0.764 |
0.239 |
0.587 |
0.152 |
0.604 |
0.141 |
0.417 |
0.504 |
0.373 |
0.458 |
0.540 |
0.183 |
0.504 |
0.073 |
0.493 |
0.083 |
0.302 |
0.359 |
0.151 |
0.246 |
0.449 |
0.218 |
0.392 |
0.110 |
0.291 |
0.050 |
0.096 |
0.257 |
0.034 |
0.199 |
0.238 |
0.142 |
0.271 |
0.018 |
0.204 |
0.034 |
0.126 |
0.097 |
0.092 |
0.095 |
0.215 |
0.050 |
0.145 |
0.034 |
The aim of this problem is to explore different approaches to signal resolution using a variety of common chemometric methods.
1.Plot a graph of the sum of intensities at each point in time. Verify that it looks as if there are three peaks in the data.
2.Calculate the derivative of the spectrum, scaled at each point in time to a constant sum, and at each wavelength as follows.
a.Rescale the spectrum at each point in time by dividing by the total intensity at that point in time so that the total intensity at each point in time equals 1.
b.Then calculate the smoothed five point quadratic Savitsky–Golay first derivatives as presented in Chapter 3, Table 3.6, independently for each of the 10 wavelengths. A table consisting of derivatives at 26 times and 10 wavelengths should be obtained.
c.Superimpose the 10 graphs of derivatives at each wavelength.
3.Summarise the change in derivative with time by calculating the mean of the abso-
lute value of the derivative over all 10 wavelengths at each point in time. Plot a graph of this, and explain why a value close to zero indicates a good pure or composition 1 point in time. Show that this suggests that points 6, 17 and 26 are good estimates of pure spectra for each component.
4.The concentration profiles of each component can be estimated using MLR as follows. a. Obtain estimates of the spectra of each pure component at the three points of
ˆ
highest purity, to give an estimated spectral matrix S .
ˆ = ˆ ˆ ˆ −1 b. Using MLR calculate C X .S .(S .S ) .
c. Plot a graph of the predicted concentration profiles
5.An alternative method is PCR. Perform uncentred PCA on the raw data matrix X and verify that there are approximately three components.
404 |
|
|
CHEMOMETRICS |
|
|
||
6. Using estimates of each pure component given |
in question 4(a), perform PCR |
||
as follows. |
|
−1 |
|
ˆ |
|
.P where P is the loadings |
|
a. Using regression find the matrix R for which S = R |
|
||
matrix obtained in question 5; keep three PCs only. |
|
||
|
|
ˆ |
|
b. Estimate the elution profiles of all three peaks since C ≈ T .R. |
|||
c. Plot these graphically. |
|
|
|
Problem 6.5 Titration of Three Spectroscopically Active Compounds with pH |
|||
Section 6.2.2 |
Section 6.3.3 Section 6.4.1.2 |
The data in the table on page 405 represent the spectra of a mixture of three spectroscopically active species recorded at 25 wavelengths over 36 different values of pH.
1.Perform PCA on the raw uncentred data, and obtain the scores and loadings for the first three PCs.
2.Plot a graph of the loadings of the first PC and superimpose this on the graph of the average spectrum over all the observed pHs, scaling the two graphs so that they are of approximately similar size. Comment on why the first PC is not very useful for discriminating between the compounds.
3.Calculate the logarithm of the correlation coefficient between each successive spectrum, and plot this against pH (there will be 35 numbers; plot the logarithm of the correlation between the spectra at pH 2.15 and 2.24 against the lower pH). Show how this is consistent with there being three different spectroscopic species in the mixture. On the basis of three components, are there pure regions for each components, and over which pH ranges are these?
4.Centre the data and produce three scores plots, those of PC2 vs PC1, PC3 vs PC1 and PC3 vs PC2. Label each point with pH (Excel users will have to adapt the macro provided). Comment on these plots, especially in the light of the correlation graph in question 3.
5.Normalise the scores of the first two PCs obtained in question 4 by dividing by the square root of the sum of squares at each pH. Plot the graph of the normalised scores of PC2 vs PC1, labelling each point as in question 4, and comment.
6.Using the information above, choose one pH which best represents the spectra for each of the three compounds (there may be several answers to this, but they should not differ by a great deal). Plot the spectra of each pure compound, superimposed on one another.
7.Using the guesses of the spectra for each compound in question 7, perform MLR
ˆ = −1
to obtain estimated profiles for each species by C X .S .(S .S ) . Plot a graph of the pH profiles of each species.
Problem 6.6 Resolution of Mid-infrared Spectra of a Three-component Mixture
Section 6.2.2 Section 6.2.3.1 Section 6.4.1.2
The table on page 406 represents seven spectra consisting of different mixtures of three compounds, 1,2,3-trimethylbenzene, 1,3,5-trimethylbenzene and toluene, whose midinfrared spectra have been recorded at 16 cm−1 intervals between 528 and 2000 nm, which you will need to reorganise as a matrix of dimensions 7 × 93.
EVOLUTIONARY SIGNALS |
407 |
|
|
1.Scale the data so that the sum of the spectral intensities at each wavelength equals 1 (note that this differs from the usual method which is along the rows, and is a way of putting equal weight on each wavelength). Perform PCA, without further preprocessing, and produce a plot of the loadings of PC2 vs PC1.
2.Many wavelengths are not very useful if they are low intensity. Identify those wavelengths for which the sum over all seven spectra is greater than 10 % of the wavelength that has the maximum sum, and label these in the graph in question 1.
3.Comment on the appearance of the graph in question 2, and suggest three wavelengths that are typical of each of the compounds.
4.Using the three wavelengths selected in question 3, obtain a 7 × 3 matrix of relative
|
ˆ |
concentrations in each of the spectra and call this C . |
|
5. Calling the original data |
X, obtain the estimated spectra for each compound by |
S = (Cˆ .Cˆ )−1.Cˆ .X and |
plot these graphically. |
Chemometrics: Data Analysis for the Laboratory and Chemical Plant.
Richard G. Brereton
Copyright ∂ 2003 John Wiley & Sons, Ltd.
ISBNs: 0-471-48977-8 (HB); 0-471-48978-6 (PB)
Appendices
A.1 Vectors and Matrices
A.1.1 Notation and Definitions
A single number is often called a scalar, and is represented by italics, e.g. x.
A vector consists of a row or column of numbers and is represented by bold lower case italics, e.g. x. For example, x = 3 −11 9 0 is a row vector and
5.6 y = 2.8
1.9
is a column vector.
A matrix is a two-dimensional array of numbers and is represented by bold upper case italics e.g. X. For example,
|
|
|
3 |
|
X = |
12 |
8 |
||
− |
2 |
14 |
1 |
|
|
|
|
|
is a matrix.
The dimensions of a matrix are normally presented with the number of rows first and the number of columns second, and vectors can be considered as matrices with one dimension equal to 1, so that x above has dimensions 1 × 4 and X has dimensions 2 × 3.
A square matrix is one where the number of columns equals the number of rows. For example,
|
|
−7 |
4 |
−1 |
|
Y |
11 |
3 |
6 |
||
|
= |
2 |
−4 |
−12 |
|
is a square matrix.
An identity matrix is a square matrix whose elements are equal to 1 in the diagonal and 0 elsewhere, and is often denoted by I. For example,
|
|
|
I = |
1 |
0 |
0 |
1 |
is an identity matrix.
The individual elements of a matrix are often referenced as scalars, with subscripts referring to the row and column; hence, in the matrix above, y21 = 11, which is the element in row 2 and column 1. Optionally, a comma can be placed between the subscripts for clarity; this is useful if one of the dimensions exceeds 9.
410 |
CHEMOMETRICS |
|
|
A.1.2 Matrix and Vector Operations
A.1.2.1 Addition and Subtraction
Addition and subtraction is the most straightforward operation. Each matrix (or vector) must have the same dimensions, and simply involves performing the operation element by element. Hence
|
8 |
4 |
11 |
−3 |
|
|
19 |
7 |
|
9 |
7 |
0 |
7 |
|
|
9 |
0 |
−−2 |
4 + − 5 |
6 |
= − 3 |
10 |
A.1.2.2 Transpose
Transposing a matrix involves swapping the columns and rows around, and may be denoted by a right-hand-side superscript ( ). For example, if
|
|
|
|
|
|
|
Z = |
3.1 0.2 6.1 4.8 |
|||||
9.2 3.8 2.0 5.1 |
||||||
then |
Z |
|
|
0.2 |
3.8 |
|
|
|
= |
|
3.1 |
9.2 |
|
|
|
4.8 |
5.1 |
|||
|
|
|
|
6.1 |
2.0 |
|
|
|
|
|
|
|
|
Some authors used a superscript T instead.
A.1.2.3 Multiplication
Matrix and vector multiplication using the ‘dot’ product is denoted by the symbol ‘.’ between matrices. It is only possible to multiply two matrices together if the number of columns of the first matrix equals the number of rows of the second matrix. The number of rows of the product will equal the number of rows of the first matrix, and the number of columns equal the number of columns of the second matrix. Hence a 3 × 2 matrix when multiplied by a 2 × 4 matrix will give a 3 × 4 matrix.
Multiplication of matrices is not commutative, that is, generally A.B =B.A even if the second product is allowable. Matrix multiplication can be expressed in the form of summations. For arrays with more than two dimensions (e.g. tensors), conventional symbolism can be awkward and it is probably easier to think in terms of summations.
If matrix A has dimensions I × J and matrix B has dimensions J × K, then the product C of dimensions I × K has elements defined by
J
|
|
|
|
|
|
|
cik= |
aij bj k |
|
|
|
|
|
||
|
|
|
|
|
|
|
|
j =1 |
|
|
|
|
|
|
|
Hence |
9 |
3 |
|
|
0 |
1 |
8 |
5 |
|
|
54 |
93 |
123 |
42 |
|
|
|
|
|||||||||||||
|
1 |
7 |
|
· |
|
10 |
11 |
|
= |
|
6 |
17 |
67 |
38 |
|
|
2 |
5 |
|
|
12 |
25 |
62 |
31 |
|
||||||
|
|
|
|
|
6 |
3 |
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
APPENDICES |
411 |
|
|
To illustrate this, the element of the first row and second column of the product is given by 17 = 1 × 10 + 7 × 1.
When several matrices are multiplied together it is normal to take any two neighbouring matrices, multiply them together and then multiply this product with another neighbouring matrix. It does not matter in what order this is done, hence A.B.C = (A.B).C = A.(B.C ). Hence matrix multiplication is associative. Matrix multiplication is also distributive, that is, A.(B + C ) = A.B + A.C .
A.1.2.4 Inverse
Most square matrices have inverses, defined by the matrix which when multiplied with the original matrix gives the identity matrix, and is represented by a −1 as a right- hand-side superscript, so that D.D−1 = I . Note that some square matrices do not have inverses: this is caused by there being correlations in the original matrix; such matrices are called singular matrices.
A.1.2.5 Pseudo-inverse
In several sections of this text we use the idea of a pseudo-inverse. If matrices are not square, it is not possible to calculate an inverse, but the concept of a pseudo-inverse
exists and is employed in regression analysis.
If A = B.C then B .A = B .B.C , so (B .B)−1.B .A = C and (B .B)−1.B is said
to be the left pseudo-inverse of B.
Equivalently, A.C = B.C .C , so A.C .(C .C )−1 = B and C .(C .C )−1 is said to be the right pseudo-inverse of C.
In regression, the equation A ≈ B.C is an approximation; for example, A may represent a series of spectra that are approximately equal to the product of two matrices such as scores and loadings matrices, hence this approach is important to obtain the best fit model for C knowing A and B or for B knowing A and C.
A.1.2.6 Trace and Determinant
Other properties of square matrices sometimes encountered are the trace, which is the sum of the diagonal elements, and the determinant, which relates to the size of the matrix. A determinant of 0 indicates a matrix without an inverse. A very small determinant often suggests that the data are fairly correlated or a poor experimental design resulting in fairly unreliable predictions. If the dimensions of matrices are large and the magnitudes of the measurements are small, e.g. 10−3, it is sometimes possible to obtain a determinant close to zero even though the matrix has an inverse; a solution to this problem is to multiply each measurement by a number such as 103 and then remember to readjust the magnitude of the numbers in resultant calculations to take account of this later.
A.1.2.7 Vector length
An interesting property that chemometricians sometimes use is that the product of the transpose of a column vector with itself equals the sum of square of elements of the vector, so that x .x = x2. The length of a vector is given by (x .x) = √ x2 or
412 |
CHEMOMETRICS |
|
|
the square root of the sum of its elements. This can be visualised in geometry as the length of the line from the origin to the point in space indicated by the vector.
A.2 Algorithms
There are many different descriptions of the various algorithms in the literature. This Appendix describes one algorithm for each of four regression methods.
A.2.1 Principal components analysis
NIPALS is a common, iterative algorithm often used for PCA. Some authors use another method called SVD (singular value decomposition). The main difference is that NIPALS extracts components one at a time, and can be stopped after the desired number of PCs has been obtained. In the case of large datasets with, for example, 200 variables (e.g. in spectroscopy), this can be very useful and reduce the amount of effort required. The steps are as follows.
Initialisation
1.Take a matrix Z and, if required, preprocess (e.g. mean centre or standardise) to give the matrix X which is used for PCA.
New Principal Component
2.Take a column of this matrix (often the column with greatest sum of squares) as the first guess of the scores first principal component; call it initial tˆ.
Iteration for each Principal Component
3. Calculate
unnorm pˆ = initial tˆ .X
tˆ2
4. Normalise the guess of the loadings, so
unnorm pˆ
pˆ =
unnorm pˆ2
5. Now calculate a new guess of the scores:
new tˆ = X.pˆ
Check for Convergence
6. Check if this new guess differs from the first guess; a simple approach is to
look at the size of the sum of square difference in the old and new scores, i.e.(initial tˆ − new tˆ)2. If this is small the PC has been extracted, set the PC scores
(t) and loadings (p) for the current PC to tˆ and pˆ . Otherwise, return to step 3, substituting the initial scores by the new scores.