Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Казанский национальный исследовательский технологический университет

Предмет:

Химия

Файл:

Brereton Chemometrics

.pdf

Скачиваний:

Добавлен:

15.08.2013

Размер:

4.3 Mб

Скачать

☆

<<< < Предыдущая 25 26 27 28 29 30 31 32 33 34 35 3637 / 5037 38 39 40 41 42 43 44 45 46 47 48 49 > Следующая >>>

352	CHEMOMETRICS

0.6

0.4

0.2

PC3

−0.2

−0.4 0.5

PC2

			8	9 10
			8	5
				5
		6	7	11
				11
12
14
			15	21
				16
			17
20		18
	19			1
				0.95
−0.5				0.9
				0.85
	−1	0.8		PC1
	−1

Figure 6.11

Scores corresponding to Figure 6.10(b) but normalised over three PCs and presented in three dimensions

involves summing each row to a constant total. Put mathematically:

rs xij =	xij
	J

xij

j =1

Note that some people call this normalisation, but we will avoid that terminology, as this method is distinct from that in Section 6.2.2. The inﬂuence on PC scores plots has already been introduced (Chapter 4, Section 4.3.6.2) but will be examined in more detail in this chapter.

Figure 6.12(a) shows what happens if the rows of dataset A are ﬁrst scaled to a constant total and then PCA performed on this data. At ﬁrst glance this appears rather discouraging, but that is because the noise points have a disproportionate inﬂuence. These points contain largely nonsensical data, which is emphasised when scaling each point in time to the same total. An expansion of points 5–19 is slightly more encouraging [Figure 6.12(b)], but still not very good. Performing PCA only on points 5–19 (after scaling the rows as described above), however, provides a very clear picture of what is happening; all the points fall roughly on a straight line, with the purest points at the end [Figure 6.12(c)]. Unlike normalising the scores after PCA (Section 6.2.2), where the data must fall exactly on a geometric ﬁgure such as a circle or sphere (dependent on the number of PCs chosen), the straight line is only approximate and depends on there being two components in the region of the data that have been chosen.

The corresponding scores plot for the ﬁrst two PCs of dataset B, using points 5–20, is presented in Figure 6.13(a). There are now two linear regions, one between compounds

EVOLUTIONARY SIGNALS	353

		7
		6			1
		6
		5
	PC2	4
	PC2
		3
		2
		1
		22	224
		0	224
		0	23
−4	−2	0	23	4	6	8	10	12	14	16	18
−4	−2	0	2	4	6	8	10	12	14	16	18
		−1									PC1
		21									PC1
		21									25
		−2									25
		−2
		−3

(a) Entire dataset

		0.05
	17
		0.04
			13			5
			13
		0.03	18
		0.03	15
		PC1	14
		PC1	16
		0.02	16		7
			12	9	7
				9
		0.01	19	11
		0.01	19	8
				8
		0		10
−0.04	−0.02	0	0.02	10	0.06	0.08
−0.04	−0.02	0	0.02	0.04	0.06	0.08
		−0.01				PC2
		−0.02		6
		−0.03

(b) Expansion of region datapoints 5 to 19

Figure 6.12

Scores plot of dataset A, each row summed to a constant total, PC2 versus PC1

354								CHEMOMETRICS
0.3
0.2					5
0.2					6
					879	10
0.1						11
0.1
						12
0							13
0	0.05	0.1	0.15	0.2	0.25	0.3	14 0.35	0.4	0.45
−0.1							15 18
							1617
−0.2
−0.3
−0.4									19
−0.5
		(C) Performing the scaling and then PCA exclusively over points 5 to 19

Figure 6.12

(continued )

A (fastest) and B, and another between compounds B and C (slowest). Some important features are of interest. The ﬁrst is that there are now three main directions in the graph, but the direction due to B is unlikely to represent the pure compound, and probably the line would need to be extended further along the top right-hand corner. However, it looks likely that there is only a small or negligible region where the three components co-elute, otherwise the graph could not easily be characterised by two straight lines. The trends are clearer in three dimensions [Figure 6.13(b)]. Note that the point at time 5 is probably inﬂuenced by noise.

Summing each row to a constant total is not the only method of dealing with individual rows or spectra. Two variations below can be employed.

1.Selective summation to constant total. This allows each portion of a row to be scaled to a constant total, for example it might be interesting to scale the wavelengths 200–300, 400–500 and 500–600 nm each to 1. Or perhaps the wavelengths 200–300 nm are more diagnostic than the others, so why not scale these to a total of 5, and the others to a total of 1? Sometimes more than one type of measurement can be used to study an evolutionary process, such as UV/vis and MS, and each data block could be scaled to a constant total. When doing selective summation it is important to consider very carefully the consequences of preprocessing.

2.Scaling to a base peak. In some forms of measurement, such as mass spectrometry (e.g. LC–MS or GC–MS), it is possible to select a base peak and scale to this; for

EVOLUTIONARY SIGNALS	355

	0.2					B
						B
	0.15					12
	0.1					13	1110
	0.1					14	5	98
						14	5	98
	0.05							7	A
	0.05							6
PC2	0							15
	0
	0	0.05	0.1	0.15	0.2	0.25	0.3		0.35	0.4
	−0.05								16	PC1
									16	PC1
	−0.1								17
									17
	−0.15									18
	−0.2									19
	−0.2									20
	−0.25									C
	−0.25

(a) Two PCs.

PC3

0.15

0.1

0.05		14	15			18	19
0	13						C
0	13	B			16	17	20
−0.05		B			16	17	20
−0.05	12
−0.1	12
−0.1
−0.15		11	5
−0.2			5	9 8 7
−0.2			10
−0.2			10
−0.1				6 A
	0
PC2	0.1						0.34	0.36	0.38
	0.2		0.26	0.28	0.3	0.32	0.34	0.36	0.38
	0.2	0.24		0.28	0.3	0.32
		0.24

PC1

(b) Three PCs.

Figure 6.13

Scores plot of dataset B with rows summed to a constant total between times 5 and 20 and three main directions indicated (a) Two PCs (b) Three PCs

example, if the aim is to analyse the LC–MS results for two isomers, ratioing to the molecular ion can be performed, so that

scaled xij = xij xi(molecular ion)

In certain cases the molecular ion can then be discarded. This method of preprocessing can be used to investigate how the ratio of fragment ions varies across a cluster.

356	CHEMOMETRICS

6.2.3.2 Scaling the Columns

In many cases it is useful to scale along the columns, e.g. each wavelength or mass number or spectral frequency. This can be used to put all the variables on a similar scale.

Mean centring, involving subtracting the mean of each column, is the simplest method. Many PC packages do this automatically, but in the case of signal analysis is often inappropriate, because the interest is about variability above the baseline rather that around an average.

Standardisation is a common technique that has already been discussed (Chapter 4, Section 4.3.6.4) and is sometimes called autoscaling. It can be mathematically described by

stand x				xij −				j
stand x				xij −		x		j
ij =

		I		(xij					j )2	/I
			1	(xij	−		x		j )2	/I
	i		1		−
		=

where there are I points in time and xj is the average of variable j . Note that it is conventional to divide by I rather than I − 1 in this application, if doing the calculations check whether the package defaults to the ‘population’ rather than ‘sample’ standard deviation. Matlab users should be careful when performing this scaling. This can be useful, for example, in mass spectrometry where the variation of an intense peak (such as a molecular ion of isomers) is no more signiﬁcant than that of a much less intense peak, such as a signiﬁcant fragment ion. However, standardisation will also emphasize variables that are pure noise, and if there are, for example, 200 mass numbers of which 180 correspond to noise, this could substantially degrade the analysis.

The most dramatic change is normally to the loadings plot. Figure 6.14 illustrates this for dataset B. The scores plot hardly changes in appearance. The loadings plot however, has changed considerably in appearance, however, and is much clearer and more spread out than in Figure 6.6.

Standardisation is most useful if the magnitudes of the variables are very different, as might occur in LC–MS. Table 6.3 is of dataset C, which consists of 25 points in time and eight measurements, making a 25 × 8 data matrix. As can be seen, the magnitude of the measurements is different, with variable H having a maximum of 100, but others being much smaller. We assume that the variables are not in a particular sequence, or are not best represented sequentially, so the loadings graphs will consist of a series of points that are not joined up. Figure 6.15 is of the raw proﬁle together with scores and loadings plots. The scores plot suggests that there are two components in the mixture, but the loadings are not very well distinguished and are dominated by variable H. Standardisation (Figure 6.16) largely retains the pattern in the scores plot but the loadings change radically in appearance, and in this case fall approximately on a circle because there are two main components in the mixture. The variables corresponding most to each pure component fall at the ends of the circle. It is important to recognise that this pattern is an approximation and will only happen if there are two main components, otherwise the loadings will fall on to the surface of a sphere (if three PCs are employed and there are three compounds in the mixture) and so on. However, standardisation can have a remarkable inﬂuence on the appearance of loadings plots.

EVOLUTIONARY SIGNALS												357
						4
							PC2
						3					13
						3					12
											12
											14
						2
						1				15	11
											11

				5
						0
	−4	−3	21	−2		−1	0	1	2	3	4	5
		PC1		20		6					10
		PC1				−1
										16


								7
						19 −2			17	8	9
								18	17
								18
						−3
	0.5
	0.4									G
	0.4									F
										H
	0.3										E
											E
	0.2
	0.1										I
	0.1
PC2	0									D
	0
		0	0.05		0.1	0.15		0.2	0.25	0.3	0.35	0.4
		0	0.05		0.1	0.15		0.2	0.25	0.3	0.35	0.4
	−0.1						PC1			C	J
										C

	−0.2
									L	B
	−0.3								L	K
	−0.3									K
	−0.4									A
	−0.4
	−0.5

Figure 6.14

Scores and loadings of PC2 versus PC1 after dataset B has been standardised

358	CHEMOMETRICS

Table 6.3 Two-way dataset C.

	A	B	C	D	E	F	G	H

1	0.407	0.149	0.121	0.552	−0.464	0.970	0.389	−0.629
2	0.093	−0.062	0.084	−0.015	−0.049	0.178	0.478	1.073
3	0.044	0.809	0.874	0.138	0.529	−1.180	0.040	1.454
4	−0.073	0.307	−0.205	0.518	1.314	2.053	0.658	7.371
5	1.461	1.359	−0.272	1.087	2.801	0.321	0.080	20.763
6	1.591	4.580	0.207	2.381	5.736	3.334	2.155	41.393
7	4.058	7.030	0.280	2.016	9.001	4.651	3.663	67.949
8	4.082	8.492	0.304	4.180	11.916	5.705	4.360	92.152
9	5.839	10.469	0.529	3.764	12.184	6.808	3.739	105.228
10	5.688	10.525	1.573	5.193	12.100	5.720	5.621	106.111
11	3.883	10.111	2.936	4.802	10.026	5.292	7.061	99.404
12	3.630	9.139	2.356	4.739	9.257	4.478	7.530	92.409
13	2.279	8.052	3.196	3.777	9.926	3.228	10.012	92.727
14	2.206	7.952	4.229	5.118	8.629	1.869	9.403	86.828
15	1.403	5.906	2.867	4.229	7.804	1.234	8.774	73.230
16	1.380	5.523	1.720	2.529	4.845	2.249	6.621	52.831
17	0.991	2.820	0.825	1.986	2.790	1.229	3.571	31.438
18	0.160	0.993	0.715	0.591	1.594	0.880	1.662	15.701
19	0.562	−0.018	−0.348	−0.290	0.567	0.070	1.257	6.528
20	0.590	−0.308	−0.715	0.490	0.384	0.595	0.409	2.657
21	0.309	0.371	−0.394	0.077	−0.517	0.434	−0.250	0.551
22	−0.132	−0.081	−0.861	−0.279	−0.622	−0.640	1.166	0.079
23	0.371	0.342	−0.226	0.374	−0.284	0.177	−0.751	−0.197
24	−0.215	−0.577	−0.297	0.834	0.720	−0.248	0.470	−1.053
25	−0.051	0.608	−0.070	−0.087	−0.068	−0.537	−0.208	0.601

Sometimes weighting by the standard deviation can be performed without centring,

so that				xij
scaled xij =				xij

		I		(xij			j )2/I
			1	(xij	−	x	j )2/I
	i		1		−
		=

It is, of course, possible to use any weighting criterion for the columns, so that

scaled xij = j w.xij

where w is a weighting factor. The weights may relate to noise content or standard deviations or signiﬁcance of a variable. Fairly complex criteria can be employed. In the extreme if w = 0, this becomes a form of variable selection, which will be discussed in Section 6.2.4.

In rare and interesting cases it is possible to rank the size of the variables along each column. The suitability depends on the type of preprocessing performed ﬁrst on the rows. However, a common method is to give the most intense reading in any column a value of I and the least intense 1. If the absolute values of each variable are not very meaningful, this procedure is an alternative that takes into account relative intensities. This procedure is exempliﬁed by reference to the dataset C, and illustrated in Table 6.4.

1.Choose a region where the peaks elute, in this case from time 4 to 19 as suggested by the scores plot in Figure 6.15.

EVOLUTIONARY SIGNALS	359

2.Scale the data in this region, so that each row is of a constant total.

3.Rank the data in each column, from 1 (low) to 16 (high).

The PC scores and loadings plots are presented in Figure 6.17. Many similar conclusions can be deduced as in Figure 6.16. For example, the loadings arising from measurement C are close to the slowest eluting peak centred on times 14–16, whereas measurements A–F correspond mainly to the fastest eluting peak. When ranking variables it is unlikely that the resultant scores and loadings plots will fall on to a smooth geometric ﬁgure such as a circle or a line. However, this procedure can be useful for

	14
	12
	10
	8
Intensity	6
Intensity
	4
	2
	0
	1		5	9	13		17	21	25
	−2				Datapoint
	6
						15		14
	4							13
								13
	PC2				16
	2
				17				12
			18		PC1			12
	0	19	18		PC1			11
−20		4	20	40	60	80		100	120
−20		0	20	40	60	80		100	120
				5	6
	−2				6
	−2				7			10
					7			10
								8
	−4
								9
	−6

Figure 6.15

Intensity proﬁle and unscaled scores and loadings of PC2 versus PC1 from dataset in Table 6.3

360	CHEMOMETRICS

PC2

0.8

0.6

0.4	C
	C
0.2
	D
0						H
0	B	0.2	0.4	0.6	0.8	1	1.2
−0.2	E			PC1
	A
−0.4	F
−0.6

Figure 6.15

(continued )

exploratory graphical analysis, especially if the dataset is fairly complex with several different compounds and also many measurements on different intensity scales.

It is, of course, possible to scale both the rows and columns simultaneously, ﬁrst by scaling the rows and then the columns. Note that the reverse (scaling the columns ﬁrst) is rarely useful and standardisation followed by summing to a constant total has no physical meaning.

6.2.4 Variable Selection

Variable selection has an important role throughout chemometrics, but will be described below in the context of coupled chromatography. This involves keeping only a portion of the original measurements, selecting only those such as wavelengths or masses that are most relevant to the underlying problem. There are a huge number of combinations of approaches limited only by the imagination of the chromatographer or spectroscopist. In this section we give only a brief summary of some of the main methods. Often several steps are combined.

Variable selection is particularly important in LC–MS and GC–MS. Raw data form what is sometimes called a sparse data matrix, in which the majority of data points are zero or represent noise. In fact, only a small percentage (perhaps 5 % or less) of the measurements are of any interest. The trouble with this is that if multivariate methods are applied to the raw data, often the results are nonsense, dominated by noise. Consider the case of performing LC–MS on two closely eluting isomers, whose fragment ions are of principal interest. The most intense peak might be the molecular

EVOLUTIONARY SIGNALS												361
						3
						2.5				14
										14
						2			15
					PC2				15
						1.5				13

						1	16
							16
				18		0.5				12
				18		170					11


−4	−3	19	4		−1	0	1	2	3		4	5
−4	−3	−2	4	5	−1	0	1	2	3		4	5
				5		−0.5			PC1
						6			PC1
						−1					10
						−1.5		7		8
						−1.5				8
						−2				9
										9
						−2.5
	0.8
	0.6								C
	0.6
	0.4									G
	0.4
	0.2
PC2											D
PC2	0
	0										H
	0	0.05			0.1	0.15	0.2	0.25	0.3	0.35	B 0.4	0.45
	−0.2						PC1				E
	−0.2						PC1
	−0.4									A
										F
	−0.6

Figure 6.16

Scores and loadings of PC2 versus PC1 after the data in Table 6.3 have been standardised

ion, but in order to study the fragmentation ions, a method such as standardisation described above is required to place equal signiﬁcance on all the ions. Unfortunately, not only are perhaps 20 or so fragment ions increased in importance, but so are 200 or so ions that represent pure noise, so the data become worse, not better. Typically, out of 200–300 masses, there may be around 20 signiﬁcant ones, and the aim of variable selection is to ﬁnd these key measurements. However, too much variable reduction has the disadvantage that the dimensions of the multivariate matrices are reduced. It is important to ﬁnd an optimum size as illustrated in Figure 6.18. What tricks can we use to remove irrelevant variables?

<<< < Предыдущая 25 26 27 28 29 30 31 32 33 34 35 3637 / 5037 38 39 40 41 42 43 44 45 46 47 48 49 > Следующая >>>

Соседние файлы в предмете Химия

#
15.08.20134.29 Mб17Baer M., Billing G.D. (eds.) - The role of degenerate states in chemistry (Adv.Chem.Phys. special issue, Wiley, 2002).pdf
#
15.08.20137.08 Mб55Basov N.I. i dr. Raschet i konstruirovanie formiruyushchego instrumenta dlya izgotovleniya izdelij (1991.pdf
#
15.08.20135.59 Mб68Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
#
15.08.2013324.82 Кб32benzyne-cyclization.pdf
#
15.08.201314.48 Mб18Borowko M. 2000 Computational methods in surface and colloid science.djvu
#
15.08.20134.3 Mб48Brereton Chemometrics.pdf
#
15.08.20131.07 Mб30Burshtejn K.Ya., Shorygin P.P. Kvantovohimicheskie raschety v organicheskoj himii i molekulyarnoj.pdf
#
15.08.201321.36 Mб45Carey F.A. - Organic Chemistry (2004)(en).djvu
#
15.08.201321.36 Mб39Carey F.A. Advanced organic chemistry 5ed., MGH, 2004.djvu
#
15.08.201311.62 Mб23Carey F.A. Advanced organic chemistry. Part A structure and mechanisms 1938.djvu
#
15.08.20138.77 Mб17Carey F.A. Advanced organic chemistry. Part B reaction and synthesis 1938.djvu