Brereton Chemometrics
.pdfEVOLUTIONARY SIGNALS |
353 |
|
|
|
|
7 |
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
5 |
|
|
|
|
|
|
|
|
|
|
|
PC2 |
4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
3 |
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
22 |
224 |
|
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
||
|
|
23 |
|
|
|
|
|
|
|
|
||
−4 |
−2 |
0 |
4 |
6 |
8 |
10 |
12 |
14 |
16 |
18 |
||
2 |
||||||||||||
|
|
−1 |
|
|
|
|
|
|
|
|
PC1 |
|
|
|
21 |
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
25 |
||
|
|
−2 |
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
||
|
|
−3 |
|
|
|
|
|
|
|
|
|
(a) Entire dataset
|
|
0.05 |
|
|
|
|
|
|
17 |
|
|
|
|
|
|
|
|
0.04 |
|
|
|
|
|
|
|
|
13 |
|
|
5 |
|
|
|
|
|
|
|
||
|
|
0.03 |
18 |
|
|
|
|
|
|
15 |
|
|
|
||
|
|
PC1 |
14 |
|
|
|
|
|
|
16 |
|
|
|
||
|
|
0.02 |
|
7 |
|
||
|
|
|
12 |
9 |
|
||
|
|
|
|
|
|
||
|
|
0.01 |
19 |
11 |
|
|
|
|
|
8 |
|
|
|||
|
|
|
|
|
|
||
|
|
0 |
|
10 |
|
|
|
−0.04 |
−0.02 |
0 |
0.02 |
0.06 |
0.08 |
||
0.04 |
|||||||
|
|
−0.01 |
|
|
|
PC2 |
|
|
|
−0.02 |
|
6 |
|
|
|
|
|
−0.03 |
|
|
|
|
(b) Expansion of region datapoints 5 to 19
Figure 6.12
Scores plot of dataset A, each row summed to a constant total, PC2 versus PC1
354 |
|
|
|
|
|
|
|
CHEMOMETRICS |
|
0.3 |
|
|
|
|
|
|
|
|
|
0.2 |
|
|
|
|
5 |
|
|
|
|
|
|
|
|
6 |
|
|
|
|
|
|
|
|
|
|
879 |
10 |
|
|
|
0.1 |
|
|
|
|
|
11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
|
|
0 |
|
|
|
|
|
|
13 |
|
|
0 |
0.05 |
0.1 |
0.15 |
0.2 |
0.25 |
0.3 |
14 0.35 |
0.4 |
0.45 |
−0.1 |
|
|
|
|
|
|
15 18 |
|
|
|
|
|
|
|
|
|
1617 |
|
|
−0.2 |
|
|
|
|
|
|
|
|
|
−0.3 |
|
|
|
|
|
|
|
|
|
−0.4 |
|
|
|
|
|
|
|
|
19 |
−0.5 |
|
|
|
|
|
|
|
|
|
|
|
(C) Performing the scaling and then PCA exclusively over points 5 to 19 |
|
Figure 6.12
(continued )
A (fastest) and B, and another between compounds B and C (slowest). Some important features are of interest. The first is that there are now three main directions in the graph, but the direction due to B is unlikely to represent the pure compound, and probably the line would need to be extended further along the top right-hand corner. However, it looks likely that there is only a small or negligible region where the three components co-elute, otherwise the graph could not easily be characterised by two straight lines. The trends are clearer in three dimensions [Figure 6.13(b)]. Note that the point at time 5 is probably influenced by noise.
Summing each row to a constant total is not the only method of dealing with individual rows or spectra. Two variations below can be employed.
1.Selective summation to constant total. This allows each portion of a row to be scaled to a constant total, for example it might be interesting to scale the wavelengths 200–300, 400–500 and 500–600 nm each to 1. Or perhaps the wavelengths 200–300 nm are more diagnostic than the others, so why not scale these to a total of 5, and the others to a total of 1? Sometimes more than one type of measurement can be used to study an evolutionary process, such as UV/vis and MS, and each data block could be scaled to a constant total. When doing selective summation it is important to consider very carefully the consequences of preprocessing.
2.Scaling to a base peak. In some forms of measurement, such as mass spectrometry (e.g. LC–MS or GC–MS), it is possible to select a base peak and scale to this; for
356 |
CHEMOMETRICS |
|
|
6.2.3.2 Scaling the Columns
In many cases it is useful to scale along the columns, e.g. each wavelength or mass number or spectral frequency. This can be used to put all the variables on a similar scale.
Mean centring, involving subtracting the mean of each column, is the simplest method. Many PC packages do this automatically, but in the case of signal analysis is often inappropriate, because the interest is about variability above the baseline rather that around an average.
Standardisation is a common technique that has already been discussed (Chapter 4, Section 4.3.6.4) and is sometimes called autoscaling. It can be mathematically described by
stand x |
|
|
xij − |
|
j |
|
|||||
|
|
x |
|
||||||||
ij = |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
I |
|
(xij |
|
|
|
j )2 |
/I |
||
|
|
|
1 |
− |
x |
||||||
|
i |
|
|
|
|
|
|||||
|
|
|
= |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
where there are I points in time and xj is the average of variable j . Note that it is conventional to divide by I rather than I − 1 in this application, if doing the calculations check whether the package defaults to the ‘population’ rather than ‘sample’ standard deviation. Matlab users should be careful when performing this scaling. This can be useful, for example, in mass spectrometry where the variation of an intense peak (such as a molecular ion of isomers) is no more significant than that of a much less intense peak, such as a significant fragment ion. However, standardisation will also emphasize variables that are pure noise, and if there are, for example, 200 mass numbers of which 180 correspond to noise, this could substantially degrade the analysis.
The most dramatic change is normally to the loadings plot. Figure 6.14 illustrates this for dataset B. The scores plot hardly changes in appearance. The loadings plot however, has changed considerably in appearance, however, and is much clearer and more spread out than in Figure 6.6.
Standardisation is most useful if the magnitudes of the variables are very different, as might occur in LC–MS. Table 6.3 is of dataset C, which consists of 25 points in time and eight measurements, making a 25 × 8 data matrix. As can be seen, the magnitude of the measurements is different, with variable H having a maximum of 100, but others being much smaller. We assume that the variables are not in a particular sequence, or are not best represented sequentially, so the loadings graphs will consist of a series of points that are not joined up. Figure 6.15 is of the raw profile together with scores and loadings plots. The scores plot suggests that there are two components in the mixture, but the loadings are not very well distinguished and are dominated by variable H. Standardisation (Figure 6.16) largely retains the pattern in the scores plot but the loadings change radically in appearance, and in this case fall approximately on a circle because there are two main components in the mixture. The variables corresponding most to each pure component fall at the ends of the circle. It is important to recognise that this pattern is an approximation and will only happen if there are two main components, otherwise the loadings will fall on to the surface of a sphere (if three PCs are employed and there are three compounds in the mixture) and so on. However, standardisation can have a remarkable influence on the appearance of loadings plots.
EVOLUTIONARY SIGNALS |
|
|
|
|
|
|
|
|
357 |
|||
|
|
|
|
|
|
4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
PC2 |
|
|
|
|
|
|
|
|
|
|
|
3 |
|
|
|
|
13 |
|
|
|
|
|
|
|
|
|
|
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
15 |
11 |
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
−4 |
−3 |
21 |
−2 |
|
−1 |
0 |
1 |
2 |
3 |
4 |
5 |
|
|
PC1 |
|
20 |
|
6 |
|
|
|
|
10 |
|
|
|
|
|
−1 |
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
16 |
|
|||
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
|
|
|
|
|
|
|
|
|
|
19 −2 |
|
|
17 |
8 |
9 |
|
|
|
|
|
|
|
|
|
18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
−3 |
|
|
|
|
|
|
|
0.5 |
|
|
|
|
|
|
|
|
|
|
|
|
0.4 |
|
|
|
|
|
|
|
|
G |
|
|
|
|
|
|
|
|
|
|
|
F |
|
|
|
|
|
|
|
|
|
|
|
|
|
H |
|
|
|
0.3 |
|
|
|
|
|
|
|
|
|
E |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.2 |
|
|
|
|
|
|
|
|
|
|
|
|
0.1 |
|
|
|
|
|
|
|
|
|
I |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
PC2 |
0 |
|
|
|
|
|
|
|
|
D |
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
0 |
0.05 |
|
0.1 |
0.15 |
|
0.2 |
0.25 |
0.3 |
0.35 |
0.4 |
|
|
|
|
|
|||||||||
|
−0.1 |
|
|
|
|
|
PC1 |
|
|
C |
J |
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
−0.2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
L |
B |
|
|
|
−0.3 |
|
|
|
|
|
|
|
K |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
−0.4 |
|
|
|
|
|
|
|
|
A |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
−0.5 |
|
|
|
|
|
|
|
|
|
|
|
Figure 6.14
Scores and loadings of PC2 versus PC1 after dataset B has been standardised
358 |
CHEMOMETRICS |
|
|
Table 6.3 Two-way dataset C.
|
A |
B |
C |
D |
E |
F |
G |
H |
|
|
|
|
|
|
|
|
|
1 |
0.407 |
0.149 |
0.121 |
0.552 |
−0.464 |
0.970 |
0.389 |
−0.629 |
2 |
0.093 |
−0.062 |
0.084 |
−0.015 |
−0.049 |
0.178 |
0.478 |
1.073 |
3 |
0.044 |
0.809 |
0.874 |
0.138 |
0.529 |
−1.180 |
0.040 |
1.454 |
4 |
−0.073 |
0.307 |
−0.205 |
0.518 |
1.314 |
2.053 |
0.658 |
7.371 |
5 |
1.461 |
1.359 |
−0.272 |
1.087 |
2.801 |
0.321 |
0.080 |
20.763 |
6 |
1.591 |
4.580 |
0.207 |
2.381 |
5.736 |
3.334 |
2.155 |
41.393 |
7 |
4.058 |
7.030 |
0.280 |
2.016 |
9.001 |
4.651 |
3.663 |
67.949 |
8 |
4.082 |
8.492 |
0.304 |
4.180 |
11.916 |
5.705 |
4.360 |
92.152 |
9 |
5.839 |
10.469 |
0.529 |
3.764 |
12.184 |
6.808 |
3.739 |
105.228 |
10 |
5.688 |
10.525 |
1.573 |
5.193 |
12.100 |
5.720 |
5.621 |
106.111 |
11 |
3.883 |
10.111 |
2.936 |
4.802 |
10.026 |
5.292 |
7.061 |
99.404 |
12 |
3.630 |
9.139 |
2.356 |
4.739 |
9.257 |
4.478 |
7.530 |
92.409 |
13 |
2.279 |
8.052 |
3.196 |
3.777 |
9.926 |
3.228 |
10.012 |
92.727 |
14 |
2.206 |
7.952 |
4.229 |
5.118 |
8.629 |
1.869 |
9.403 |
86.828 |
15 |
1.403 |
5.906 |
2.867 |
4.229 |
7.804 |
1.234 |
8.774 |
73.230 |
16 |
1.380 |
5.523 |
1.720 |
2.529 |
4.845 |
2.249 |
6.621 |
52.831 |
17 |
0.991 |
2.820 |
0.825 |
1.986 |
2.790 |
1.229 |
3.571 |
31.438 |
18 |
0.160 |
0.993 |
0.715 |
0.591 |
1.594 |
0.880 |
1.662 |
15.701 |
19 |
0.562 |
−0.018 |
−0.348 |
−0.290 |
0.567 |
0.070 |
1.257 |
6.528 |
20 |
0.590 |
−0.308 |
−0.715 |
0.490 |
0.384 |
0.595 |
0.409 |
2.657 |
21 |
0.309 |
0.371 |
−0.394 |
0.077 |
−0.517 |
0.434 |
−0.250 |
0.551 |
22 |
−0.132 |
−0.081 |
−0.861 |
−0.279 |
−0.622 |
−0.640 |
1.166 |
0.079 |
23 |
0.371 |
0.342 |
−0.226 |
0.374 |
−0.284 |
0.177 |
−0.751 |
−0.197 |
24 |
−0.215 |
−0.577 |
−0.297 |
0.834 |
0.720 |
−0.248 |
0.470 |
−1.053 |
25 |
−0.051 |
0.608 |
−0.070 |
−0.087 |
−0.068 |
−0.537 |
−0.208 |
0.601 |
Sometimes weighting by the standard deviation can be performed without centring,
so that |
|
|
xij |
|
|
||
scaled xij = |
|
|
|
|
|
||
|
|
|
|
|
|
|
|
I |
|
(xij |
|
|
j )2/I |
||
|
|
1 |
− |
x |
|||
|
i |
|
|
|
|||
|
|
= |
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
It is, of course, possible to use any weighting criterion for the columns, so that
scaled xij = j w.xij
where w is a weighting factor. The weights may relate to noise content or standard deviations or significance of a variable. Fairly complex criteria can be employed. In the extreme if w = 0, this becomes a form of variable selection, which will be discussed in Section 6.2.4.
In rare and interesting cases it is possible to rank the size of the variables along each column. The suitability depends on the type of preprocessing performed first on the rows. However, a common method is to give the most intense reading in any column a value of I and the least intense 1. If the absolute values of each variable are not very meaningful, this procedure is an alternative that takes into account relative intensities. This procedure is exemplified by reference to the dataset C, and illustrated in Table 6.4.
1.Choose a region where the peaks elute, in this case from time 4 to 19 as suggested by the scores plot in Figure 6.15.
EVOLUTIONARY SIGNALS |
359 |
|
|
2.Scale the data in this region, so that each row is of a constant total.
3.Rank the data in each column, from 1 (low) to 16 (high).
The PC scores and loadings plots are presented in Figure 6.17. Many similar conclusions can be deduced as in Figure 6.16. For example, the loadings arising from measurement C are close to the slowest eluting peak centred on times 14–16, whereas measurements A–F correspond mainly to the fastest eluting peak. When ranking variables it is unlikely that the resultant scores and loadings plots will fall on to a smooth geometric figure such as a circle or a line. However, this procedure can be useful for
|
14 |
|
|
|
|
|
|
|
|
|
12 |
|
|
|
|
|
|
|
|
|
10 |
|
|
|
|
|
|
|
|
|
8 |
|
|
|
|
|
|
|
|
Intensity |
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
|
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
1 |
|
5 |
9 |
13 |
|
17 |
21 |
25 |
|
−2 |
|
|
|
Datapoint |
|
|
|
|
|
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
14 |
|
|
4 |
|
|
|
|
|
|
13 |
|
|
|
|
|
|
|
|
|
|
|
|
PC2 |
|
|
|
16 |
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
|
|
12 |
|
|
|
|
18 |
|
PC1 |
|
|
|
|
|
0 |
19 |
|
|
|
11 |
|
||
−20 |
|
4 |
20 |
40 |
60 |
80 |
|
100 |
120 |
|
0 |
|
|||||||
|
|
|
|
5 |
6 |
|
|
|
|
|
−2 |
|
|
|
|
|
|
|
|
|
|
|
|
7 |
|
|
10 |
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
8 |
|
|
−4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
|
−6 |
|
|
|
|
|
|
|
|
Figure 6.15
Intensity profile and unscaled scores and loadings of PC2 versus PC1 from dataset in Table 6.3
EVOLUTIONARY SIGNALS |
|
|
|
|
|
|
|
|
|
|
|
361 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
|
|
|
|
|
|
2.5 |
|
|
|
14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
|
15 |
|
|
|
|
|
|
|
|
PC2 |
|
|
|
|
|
|
|
|
|
|
|
|
1.5 |
|
|
|
13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
0.5 |
|
|
|
12 |
|
|
|
|
|
|
|
170 |
|
|
|
11 |
|
|||
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
||
−4 |
−3 |
19 |
4 |
|
−1 |
0 |
1 |
2 |
3 |
|
4 |
5 |
−2 |
5 |
|
||||||||||
|
|
|
|
|
−0.5 |
|
|
PC1 |
|
|
|
|
|
|
|
|
|
|
6 |
|
|
|
|
|
|
|
|
|
|
|
|
−1 |
|
|
|
|
10 |
|
|
|
|
|
|
|
−1.5 |
|
7 |
|
8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
−2 |
|
|
|
9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
−2.5 |
|
|
|
|
|
|
|
0.8 |
|
|
|
|
|
|
|
|
|
|
|
|
0.6 |
|
|
|
|
|
|
|
C |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.4 |
|
|
|
|
|
|
|
|
G |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.2 |
|
|
|
|
|
|
|
|
|
|
|
PC2 |
|
|
|
|
|
|
|
|
|
|
D |
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
H |
|
|
|
0 |
0.05 |
|
|
0.1 |
0.15 |
0.2 |
0.25 |
0.3 |
0.35 |
B 0.4 |
0.45 |
|
−0.2 |
|
|
|
|
|
PC1 |
|
|
|
E |
|
|
|
|
|
|
|
|
|
|
|
|
||
|
−0.4 |
|
|
|
|
|
|
|
|
A |
|
|
|
|
|
|
|
|
|
|
|
|
F |
|
|
|
−0.6 |
|
|
|
|
|
|
|
|
|
|
|
Figure 6.16
Scores and loadings of PC2 versus PC1 after the data in Table 6.3 have been standardised
ion, but in order to study the fragmentation ions, a method such as standardisation described above is required to place equal significance on all the ions. Unfortunately, not only are perhaps 20 or so fragment ions increased in importance, but so are 200 or so ions that represent pure noise, so the data become worse, not better. Typically, out of 200–300 masses, there may be around 20 significant ones, and the aim of variable selection is to find these key measurements. However, too much variable reduction has the disadvantage that the dimensions of the multivariate matrices are reduced. It is important to find an optimum size as illustrated in Figure 6.18. What tricks can we use to remove irrelevant variables?