Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Brereton Chemometrics

.pdf
Скачиваний:
48
Добавлен:
15.08.2013
Размер:
4.3 Mб
Скачать

362

CHEMOMETRICS

 

 

Table 6.4 Method for ranking variables using the dataset in Table 6.4.

(a) Data between times 4 and 19 each row summed to a total of 1

Time

A

B

 

C

D

E

 

F

G

H

 

 

 

 

 

 

 

 

 

 

 

4

0.006

0.026

 

0.017

0.043

0.110

0.172

0.055

0.617

5

0.053

0.049

 

0.010

0.039

0.101

0.012

0.003

0.752

6

0.026

0.075

 

0.003

0.039

0.093

0.054

0.035

0.674

7

0.041

0.071

 

0.003

0.020

0.091

0.047

0.037

0.689

8

0.031

0.065

 

0.002

0.032

0.091

0.043

0.033

0.702

9

0.039

0.070

 

0.004

0.025

0.082

0.046

0.025

0.708

10

0.037

0.069

 

0.010

0.034

0.079

0.038

0.037

0.696

11

0.027

0.070

 

0.020

0.033

0.070

0.037

0.049

0.693

12

0.027

0.068

 

0.018

0.035

0.069

0.034

0.056

0.692

13

0.017

0.060

 

0.024

0.028

0.075

0.024

0.075

0.696

14

0.017

0.063

 

0.034

0.041

0.068

0.015

0.074

0.688

15

0.013

0.056

 

0.027

0.040

0.074

0.012

0.083

0.694

16

0.018

0.071

 

0.022

0.033

0.062

0.029

0.085

0.680

17

0.022

0.062

 

0.018

0.044

0.061

0.027

0.078

0.689

18

0.007

0.045

 

0.032

0.027

0.071

0.039

0.075

0.704

19

0.067

0.002

 

0.042

0.035

0.068

0.008

0.151

0.784

(b) Ranked data over these times

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Time

A

B

C

D

E

F

G

 

 

 

 

 

 

 

 

 

 

 

 

 

4

1

2

2

15

16

16

8

 

 

 

5

15

4

3

12

15

2

1

 

 

 

6

8

16

6

11

14

15

4

 

 

 

7

14

15

5

2

13

14

6

 

 

 

8

11

9

4

6

12

12

3

 

 

 

9

13

13

7

3

11

13

2

 

 

 

10

12

11

8

9

10

10

5

 

 

 

11

9

12

11

8

6

9

7

 

 

 

12

10

10

9

10

5

8

9

 

 

13

4

6

13

5

9

5

12

 

 

14

5

8

16

14

4

4

10

 

 

 

15

3

5

14

13

8

3

14

 

 

 

16

6

14

12

7

2

7

15

 

 

 

17

7

7

10

16

1

6

13

 

 

 

18

2

3

15

4

7

11

11

 

 

 

19

16

1

1

1

3

1

16

 

 

 

 

 

 

 

 

 

 

 

 

 

Some simple methods, often used as an initial filter of irrelevant variables, are as follows; note that it is often important first to have performed baseline correction (Section 6.2.1).

1.Remove variables outside a given region, e.g. in mass spectrometry these may be at low or high m/z values, in UV/vis spectroscopy there may be a significant wavelength range where there is no absorbance.

2.Sometimes is possible to measure the noise content of the chromatograms for each variable simply by looking at the standard deviation of the noise region. The higher the noise, the less significant is the mass. This technique is useful in combination with other methods often as a first step.

EVOLUTIONARY SIGNALS

 

 

 

 

 

363

 

15

 

 

 

 

 

 

 

 

 

 

 

 

 

14

15

 

 

 

 

 

 

 

17

 

 

10

 

 

 

 

 

 

 

 

 

 

 

 

16

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

13

18

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

12

 

PC2

0

 

 

 

 

19

11

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

5

10

 

15

20

25

30

 

 

 

PC1

 

 

4

 

 

 

 

 

 

 

 

 

 

 

−5

 

 

 

 

 

 

10

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

6

 

 

 

 

 

 

 

 

 

−10

 

 

 

 

 

8

9

 

 

 

 

 

 

 

7

 

−15

 

 

 

 

 

 

 

 

0.6

 

 

 

 

 

 

 

 

 

G

 

 

 

 

 

 

 

 

 

 

 

C

 

 

 

 

0.4

 

 

 

 

 

 

 

 

 

 

 

D

 

 

 

 

 

0.2

 

 

 

 

 

 

 

PC2

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.335

0.34

0.345

0.35

0.355

0.36

0.365

0.37

 

 

 

PC1

H

 

 

B

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

−0.2

 

 

 

 

 

 

 

 

 

 

 

 

 

F

 

 

 

 

 

 

 

A

 

 

 

 

−0.4

 

 

 

 

E

 

 

−0.6

Figure 6.17

Scores and loadings of PC2 versus PC1 of the ranked data in Table 6.4

364

 

 

 

 

 

 

 

 

CHEMOMETRICS

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Raw data-sparse data matrix,

Optimum

Too few

the majority is noise

size

variables

Figure 6.18

Optimum size for variable reduction

Many methods then use simple functions of the data, choosing the variables with the highest values.

1.The very simplest is to order the variables according to their mean, e.g. xj , which is the average of column j . If all the measurements are on approximately the same scale such as in many forms of spectroscopic detection, this is a good approach, but is less useful if there remain significant background peaks or if there are dominant high intensity peaks such as molecular ions that are not necessarily very diagnostic.

2.A variant is to employ the variance vj (or standard deviation). Large peaks that do not vary much may have a small standard deviation. However, this depends crucially on determining a region of the data where compounds are present, and if noise regions are included, this method will often fail.

3.A compromise is to select peaks according to a criterion of variance over mean, vj /xj . This may pick some less intense measurements that vary through interesting regions of the data. Intense peaks may still have a large variance but this might not be particularly significant relative to the average intensity. The problem with this approach, however, is that some measurements that are primarily noise could have a mean close to zero, so the ratio becomes large and they will be accidentally selected. To prevent this, first remove the noisy variables by another method and then from the remaining select those with highest relative variance. Variables can have low noise but still be uninteresting if they correspond, for example, to solvent or base peaks.

4.A modification of the method in point 3 is to select peaks using a criterion of

vj /(xj + e), where e relates to noise level. The advantage of this is that variables with low means are not accidentally selected. Of course, the value of e must be carefully chosen.

There are no general guidelines as to how many variables should be selected; some people use statistical tests, others cut off the selection according to what appears sensible or manageable. The optimum method depends on the technique employed and the general features of a particular source of data.

There are numerous other approaches, for example to look at smoothness of variables, correlations between successive points, and so on. In some cases after selecting variables, contiguous variables can then be combined into a smaller number of

EVOLUTIONARY SIGNALS

365

 

 

very significant (and less noisy) measurements; this could be valuable in the case of LC–NMR or GC–IR, where neighbouring variables often correspond to peaks in a spectrum of a single compound in a mixture, but is unlikely be valuable in HPLC–DAD, where there are often contiguous regions of a spectrum that correspond to different compounds. Sometimes features of variables, such as the change in relative intensity over a peak cluster, can also be taken into account; variables diagnostic for an individual compound are likely to vary in a smooth and predictable way, whereas those due to noise will vary in a random manner.

For each type of coupled chromatography (and indeed for any technique where chemometric methods are employed) there are specific methods for variable selection. In some cases such as LC–MS this is a crucial first step prior to further analysis, whereas in the case of HPLC–DAD it is often less essential, and omitted.

6.3 Determining Composition

After exploring data via PC plots, baseline correction, preprocessing, scaling and variable selection, as required, the next step is normally to look at the composition of different regions of the chromatograms. Most chemometric techniques try to identify pure variables that are associated with one specific component in a mixture. In chromatography these are usually regions in time where a single compound elutes, although they can also be measurements such as an m/z value characteristic of a single compound or a peak in IR spectroscopy. Below we will concentrate primarily on methods for determining pure variables in the chromatographic direction, but many can be modified fairly easily for spectroscopy. There is an enormous battery of techniques but below we summarise the main groups of approaches.

6.3.1 Composition

The concept of composition is an important one. There are many alternative ways of expressing the same idea, that of rank being popular also, which derives from matrices: ideally the rank of a matrix equals the number of independent components or nonzero eigenvectors.

A region of composition 0 contains no compounds, one of composition 1 one compound, and so on. Composition 1 regions are also selective or pure regions. A complexity arises in that because of noise, a matrix over a region of composition 1 will not necessarily be described by only one PC, and it is important to try to identify how many PCs are significant and correspond to real information.

There are many cases of varying difficulty. Figure 6.19 illustrates four cases. Case (a) is the most studied and easiest, in which each peak has a composition 1 or selective region. Although not the hardest of problems, there is often considerable value in the application of chemometrics techniques in such a situation. For example, there may be a requirement for quantification in which the complete peak profile is required, including the area of each peak in the region of overlap. The spectra of the compounds might not be very clear and chemometrics can improve the quality. In complex peak clusters it might simply be important to identify how many compounds are present, which regions are pure, what the spectra in the selective regions are and whether it is necessary to improve the chromatography. Finally, this has potential in the area of

366

CHEMOMETRICS

 

 

automation of pulling out spectra from chromatograms containing several compounds where there is some overlap. Case (b) involves a peak cluster where one or more do not have a selective region. Case (c) is of an embedded impurity peak and is surprisingly common. Many modern separations involve asymmetric peaks such as in case (d), and many conventional chemometrics methods fail under such circumstances.

To understand the problem, it is possible to produce a graph of ratios of intensities between the various components. For ideal noise free peakshapes corresponding to the four cases above, these are presented in Figure 6.20. Note the use of a logarithmic scale, as the ideal ratios will vary over a large range. Case (a) corresponds to two Gaussian peaks (for more information about peakshapes, see Chapter 3, Section 3.2.1) and is straightforward. In case (b), the ratios of the first to second and of the second to third peaks are superimposed, and it can be seen that the rate of change is different for each pair of peaks; this relates to the different separation of the elution time maxima. Note that there is a huge dynamic range, which is due to noise-free simulations being used. Case (c) is typical of an embedded peak, showing a purity maximum for the smaller component in the centre of its elution. Finally, the ratio arising from case (d) is typical of tailing peaks; many multivariate methods cannot cope easily with this type of data. However, these graphs are of ideal situations and, in practice, it is only

(a)

(b)

Figure 6.19

Different types of problems in chromatography

EVOLUTIONARY SIGNALS

367

 

 

(c)

(d)

Figure 6.19

(continued )

practicable to observe small effects if the data are of an appropriate quality. In reality, measurement noise and spectral similarity limit data quality. In practice, it is only realistic to detect two (or more components) if the ratios of intensities of the two peaks are within a certain range, for example no more than 50: 1, as indicated by region a in Figure 6.21. Outside these limits, it is unlikely that a second component will be detected. In addition, when the intensity of signal is sufficiently low (say 1 % of the maximum, outside region b in Figure 6.21), the signal may be swamped by noise, and so no signal detected. Region a would appear to be of composition 2, the overlap between regions a and b composition 1 and the chromatogram outside region b composition 0. If noise levels are higher, these regions become narrower.

Below we indicate a number of approaches for the determination of composition.

6.3.2 Univariate Methods

By far the simplest are univariate approaches. It is important not to overcomplicate a problem if not justified by the data. Most conventional chromatography software contains methods for estimating ratios between peak intensities. If two spectra are sufficiently dissimilar then this method can work well. The measurements most diagnostic for each compound can be chosen by a number of means. For the data in Table 6.1 we

368

CHEMOMETRICS

 

 

can look at the loadings plot of Figure 6.4. At first glance it may appear that measurements C and G are most appropriate, but this is not so. The problem is that the most diagnostic wavelengths for one compound may correspond to zero or very low intensity for the other one. This would mean that there will be regions of the chromatogram where one number is close to zero or even negative (because of noise), leading to very large or negative ratios. Measurements that are characteristic of both compounds but exhibit distinctly different features in each case, are better. Figure 6.22(a) plots the ratio of intensities of variables D to F. Initially this plot looks slightly discouraging, but that is because there are noise regions where almost any ratio could be obtained.

log (ratio)

log (ratio)

4

2

0

−2

−4

−6

−8

 

 

(a)

10

 

 

 

 

8

 

 

 

 

6

 

 

 

 

4

 

Second to third

 

2

 

 

 

 

0

 

 

 

 

−2

 

First to second

 

−4

 

 

 

−6

−8

−10

(b)

Figure 6.20

Ratios of peak intensities for the case studies (a)–(d) assuming ideal peakshapes and peaks detectable over an indefinite region

EVOLUTIONARY SIGNALS

369

 

 

 

6

 

5

 

4

log(ratio)

3

 

 

2

 

1

 

0

 

(c)

 

3

 

2

log(ratio)

 

 

1

 

0

 

(d)

Figure 6.20

(continued )

Cutting the region down to times 5–18 (within which range the intensities of both variables are positive) improves the situation. It is also helpful to consider the best way to display the graph, the reason being that a ratio of 2: 1 is no more significant to a ratio of 1: 2, yet using a linear scale there is an arbitrary asymmetry, for example, moving from 0 to 50 % of compound one may change the ratio from 0.1 to 1 but moving from 50 % it could change from 1 to 10. To overcome this, either use a logarithmic scale [Figure 6.22(b)] or take the smaller of the ratios D: F and F: D [Figure 6.22(c)].

The peak ratio plots suggest that there is a composition 2 region starting between times 9 and 10 and finishing between times 14 and 15. There is some ambiguity about the exact start and end, largely because there is noise imposed upon the data. In some cases peak

370

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CHEMOMETRICS

 

 

 

 

 

 

 

 

 

 

b

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

2

3

4

5

6

7

8

9

10

11 12 13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

Figure 6.21

Regions of chromatogram (a) in Figure 6.19. Region a is where the ratio of the two components is between 50: 1 and 1: 50 and region b where the overall intensity is more than 1 % of the maximum intensity

ratio plots are very helpful, but they do depend on having adequate signal to noise ratios and finding suitable variables. If a spectrum is monitored over 200 wavelengths this may not be so easy, and approaches that use all the wavelengths may be more successful. In addition, good diagnostic measurements are required, noise regions have to be eliminated and also the graphs can become complicated if there are several compounds in a portion of the chromatogram. An ideal situation would be to calculate several peak ratios simultaneously, but this then suggests that multivariate methods, as described below, have an important role to play.

Another simple trick is to sum the data to constant total at each point in time, as described above, so as to obtain values of

rs xij =

xij

J

xij

j =1

Provided that noise regions are discarded, the relative intensities of diagnostic wavelengths should change according to the composition of the data. Unlike using ratios of intensities, we are able to choose strongly associated wavelengths such as C and G, as an intensity of zero (or even a small negative number) will not unduly influence the appearance of the graph, given in Figure 6.23. The regions of composition 1 are somewhat flatter but influenced by noise, but where the relative intensity changes most is in the composition 2 region.

This approach is not ideal, but the graphs of Figure 6.23 and 6.22 are intuitively easy for the practising chromatographer (or spectroscopist) and result in the creation of a form

EVOLUTIONARY SIGNALS

371

 

 

of purity curve. The appearance of the curves can be enhanced by selecting variables that are least noisy and then calculating the relative intensity curves as a function of time for several (rather than just two) variables. Some will increase and others decrease according to whether the variable is most associated with the fastest or slowest eluting compound. Reversing the curve for one set of variables results in several superimposed purity curves, which can be averaged to give a good picture of changes over the chromatogram.

These methods can be extended to cases of embedded peaks, in which the purest point for the embedded peak does not correspond to a selective region; a weakness of using this method of ratios is that it is not always possible to determine whether a maximum (or minimum) in the purity curve is genuinely a consequence of a composition 1 region or simply the portion of the chromatogram where the concentration of one analyte is highest.

Such simple approaches can become rather messy when there are several compounds in a cluster, especially if the spectra are similar, but in favourable cases they are very effective.

 

10

 

 

 

 

 

 

8

 

 

 

 

 

ratio

6

 

 

 

 

 

 

 

 

 

 

 

Intensity

4

 

 

 

 

 

2

 

 

 

 

 

 

0

 

 

 

 

 

 

0

5

10

15

20

25

 

−2

 

 

 

 

 

 

−4

 

 

Datapoint

 

 

 

 

 

 

(a)

 

 

 

10

 

 

 

 

 

−logscale

1

 

 

 

 

 

ratio

 

 

 

 

 

0

5

10

15

20

25

Intensity

 

 

 

 

 

 

 

0.1

 

 

Datapoint

 

 

 

 

 

 

(b)

 

 

Figure 6.22

Ratio of intensity of measurements D–F for the data in Table 6.1. (a) Raw information; (b) logarithmic scale between points 5 and 18; (c) the minimum of the ratio of intensity D: F and F: D between points 5 and 18

Соседние файлы в предмете Химия