Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Brereton Chemometrics

.pdf
Скачиваний:
48
Добавлен:
15.08.2013
Размер:
4.3 Mб
Скачать

140

CHEMOMETRICS

 

 

0

5

10

15

20

25

30

0

5

10

15

20

25

30

0

5

10

15

20

25

30

Figure 3.11

Two closely overlapping peaks together with their first and second derivatives

SIGNAL PROCESSING

 

 

 

 

 

 

141

 

 

 

 

 

 

 

 

Table 3.6 Savitsky–Golay coefficients ci+j

for derivatives.

 

 

 

 

Window size j

5

7

9

5

7

9

 

 

 

 

 

 

 

 

 

 

First derivatives

 

 

Quadratic

 

 

Cubic/quartic

 

 

4

 

 

 

 

 

 

 

 

 

 

3

4

 

 

86

 

3

2

3

 

22

142

 

2

2

2

1

67

193

 

1

1

1

1

8

58

126

0

0

0

0

0

0

0

1

1

1

1

8

58

126

2

2

2

2

1

67

193

3

 

 

3

3

 

22

142

4

 

 

 

4

 

 

86

 

Normalisation

10

28

60

12

252

1188

 

Second derivatives

 

 

Quadratic/cubic

 

Quartic/quintic

 

 

4

 

 

 

 

 

 

 

 

 

 

 

28

 

117

4158

 

3

 

 

5

7

3

12243

 

2

2

0

8

603

4983

 

1

1

3

17

48

171

6963

0

2

4

20

90

630

12210

1

1

3

17

48

171

6963

2

2

0

8

3

603

4983

3

 

 

5

7

 

117

12243

4

 

 

 

28

 

 

4158

 

Normalisation

7

42

462

36

1188

56628

of convolution. The principles of convolution are straightforward. Two functions, f and g, are convoluted to give h if

 

j =p

hi =

 

fj gi+j

 

j =−p

Sometimes this operation is written using a convolution operator denoted by an asterisk, so that

h(i) = f (i) g(i)

This process of convolution is exactly equivalent to digital filtering, in the example above:

xnew (i) = x(i) g(i)

where g(i) is a filter function. It is, of course, possible to convolute any two functions with each other, provided that each is of the same size. It is possible to visualise these filters graphically. Figure 3.12 illustrates the convolution function for a three point MA, a Hanning window and a five point Savitsky–Golay second derivative quadratic/cubic filter. The resulted spectrum is the convolution of such functions with the raw data.

Convolution is a convenient general mathematical way of dealing with a number of methods for signal enhancement.

142

 

 

 

 

 

 

 

 

 

 

 

CHEMOMETRICS

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.12

From top to bottom: a three point moving average, a Hanning window and a five point Savitsky–Golay quadratic derivative window

3.4 Correlograms and Time Series Analysis

Time series analysis has a long statistical vintage, with major early applications in economics and engineering. The aim is to study cyclical trends. In the methods in Section 3.3, we were mainly concerned with peaks arising from chromatography or spectroscopy or else processes such as occur in manufacturing. There were no underlying cyclical features. However, in certain circumstances features can reoccur at regular intervals. These could arise from a geological process, a manufacturing plant or environmental monitoring, the cyclic changes being due to season of the year, time of day or even hourly events.

The aim of time series analysis is to reveal mainly the cyclical trends in a dataset. These will be buried within noncyclical phenomena and also various sources of noise. In spectroscopy, where the noise distributions are well understood and primarily stationary, Fourier transforms are the method of choice. However, when studying natural processes, there are likely to be a much larger number of factors influencing the response, including often correlated (or ARMA) noise. Under such circumstances, time series analysis is preferable and can reveal even weak cyclicities. The disadvantage is that original intensities are lost, the resultant information being primarily about how strong the evidence is that a particular process reoccurs regularly. Most methods for time series analysis involve the calculation of a correlogram at some stage.

3.4.1 Auto-correlograms

The most basic calculation is that of an auto-correlogram. Consider the information depicted in Figure 3.13, which represents a process changing with time. It appears that there is some cyclicity but this is buried within the noise. The data are presented in Table 3.7.

SIGNAL PROCESSING

143

 

 

 

12

 

 

 

 

 

 

 

10

 

 

 

 

 

 

 

8

 

 

 

 

 

 

Intensity

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

0

5

10

15

20

25

30

 

−2

 

 

Time

 

 

 

 

 

 

 

 

 

 

Figure 3.13

 

 

 

 

 

 

A time series

Table 3.7 Data of Figure 3.13 together with the data lagged by five points in time.

i

Data, l = 0

Data, l = 5

1

2.768

0.262

2

4.431

1.744

3

0.811

5.740

4

0.538

4.832

5

0.577

5.308

6

0.262

3.166

7

1.744

0.812

8

5.740

0.776

9

4.832

0.379

10

5.308

0.987

11

3.166

2.747

12

0.812

5.480

13

0.776

3.911

14

0.379

10.200

15

0.987

3.601

16

2.747

2.718

17

5.480

2.413

18

3.911

3.008

19

10.200

3.231

20

3.601

4.190

21

2.718

3.167

22

2.413

3.066

23

3.008

0.825

24

3.231

1.338

25

4.190

3.276

263.167

273.066

280.825

291.338

303.276

144

CHEMOMETRICS

 

 

A correlogram involves calculating the correlation coefficient between a time series and itself, shifted by a given number of datapoints called a ‘lag’. If there are I datapoints in the original time series, then a correlation coefficient for a lag of l points will consist of I l datapoints. Hence, in Table 3.7, there are 30 points in the original dataset, but only 25 points in the dataset for which l = 5. Point number 1 in the shifted dataset corresponds to point number 6 in the original dataset. The correlation coefficient for lag l is given by

r

I l

 

 

1

 

I l

I

 

 

 

 

 

xi xi+l

 

 

 

 

 

 

 

 

 

 

i

1

I l

i

1 xi

i

l

xi

 

l =

 

 

 

=

 

 

 

 

 

 

=

 

=

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I l

xi2

 

 

1 I l

xi

 

 

I

xi2

 

 

1 I

xi

 

 

 

 

 

 

 

 

 

 

 

 

 

I l i 1

 

I l i l

 

i 1

 

i l

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

=

 

 

 

=

 

=

 

 

 

=

 

 

 

 

 

 

 

Sometimes a simplified equation is employed:

 

 

I l

xi xi+p

 

1

 

 

I l

xi

I

xi

(I l)

 

 

I

l

 

i 1

i

 

 

i 1

 

 

 

 

 

l

 

r

 

 

 

 

 

 

 

 

 

l =

=

 

 

 

 

=

 

 

=

 

 

 

 

I

 

 

1

 

 

I

 

 

 

 

 

 

i 1 xi2 I i 1 xi I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

=

 

 

 

 

 

 

=

 

 

 

The latter equation is easier for repetitive computations because the term at the bottom needs to be calculated only once, and such shortcuts were helpful prior to the computer age. However, using modern packages, it is not difficult to use the first equation, which will be employed in this text. It is important, though, always to understand and check different methods. In most cases there is little difference between the two calculations.

There are a number of properties of the correlogram:

1.for a lag of 0, the correlation coefficient is 1;

2.it is possible to have both negative and positive lags, but for an auto-correlogram, rl = rl , and sometimes only one half of the correlogram is displayed;

3.the closer the correlation coefficient is to 1, the more similar are the two series; if a high correlation is observed for a large lag, this indicates cyclicity;

4.as the lag increases, the number of datapoints used to calculate the correlation coefficient decreases, and so rl becomes less informative and more dependent on

noise. Large values of l are not advisable; a good compromise is to calculate the correlogram for values of l up to I /2, or half the points in the original series.

The resultant correlogram is presented in Figure 3.14. The cyclic pattern is now much clearer than in the original data. Note that the graph is symmetric about the origin, as expected, and the maximum lag used in this example equals 14 points.

An auto-correlogram emphasizes only cyclical features. Sometimes there are noncyclical trends superimposed over the time series. Such situations regularly occur in economics. Consider trying to determine the factors relating to expenditure in a seaside

SIGNAL PROCESSING

 

 

 

 

 

 

145

 

 

 

 

1.2

 

 

 

 

 

 

 

1

 

 

 

 

 

 

Correlation

0.8

 

 

 

 

 

 

0.6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.4

 

 

 

 

 

 

 

0.2

 

 

 

 

 

 

 

Lag

 

 

 

 

 

 

 

0

 

 

 

−15

−10

−5

 

0

5

10

15

 

 

 

 

−0.2

 

 

 

 

 

 

 

−0.4

 

 

 

 

 

 

 

−0.6

 

 

 

Figure 3.14

Auto-correlogram of the data in Figure 3.13

resort. A cyclical factor will undoubtedly be seasonal, there being more business in the summer. However, other factors such as interest rates and exchange rates will also come into play and the information will be mixed up in the resultant statistics. Expenditure can also be divided into food, accommodation, clothes and so on. Each will be influenced to a different extent by seasonality. Correlograms specifically emphasise the cyclical causes of expenditure. In chemistry, they are most valuable when time dependent noise interferes with stationary noise, for example in a river where there may be specific types of pollutants or changes in chemicals that occur spasmodically but, once discharged, take time to dissipate.

The correlogram can be processed further either by Fourier transformation or smoothing functions, or a combination of both; these techniques are discussed in Sections 3.3 and 3.5. Sometimes the results can be represented in the form of probabilities, for example the chance that there really is a genuine underlying cyclical trend of a given frequency. Such calculations, though, make certain definitive assumptions about the underlying noise distributions and experimental error and cannot always be generalised.

3.4.2 Cross-correlograms

It is possible to extend these principles to the comparison of two independent time series. Consider measuring the levels of Ag and Ni in a river with time. Although each may show a cyclical trend, are there trends common to both metals? The crosscorrelation function between x and y can be calculated for a lag of l:

rl = cxy,l

sx sy

146 CHEMOMETRICS

where cxy,l is the covariance between the functions at lag l, given by

 

I l

 

 

 

 

 

 

 

 

 

 

for l 0

cxy,l =

(xi

x)(yi+l y)/(I l)

 

i=1

 

 

I l

 

 

 

 

 

 

 

 

 

 

 

cxy,l =

(xi+l

x)(yi y)/(I l)

for l < 0

 

i=1

 

and s corresponds to the appropriate standard deviations (see Appendix A.3.1.3 for more details about the covariance). Note that the average of x and y should strictly be recalculated according to the number of datapoints in the window but, in practice, provided that the window is not too small the overall average is acceptable.

The cross-correlogram is no longer symmetric about zero, so a negative lag does not give the same result as a positive lag. Table 3.8 is for two time series, 1 and 2. The raw time series and the corresponding cross-correlogram are presented in Figure 3.15. The raw time series appear to exhibit a long-term trend to increase, but it is not entirely obvious that there are common cyclical features. The correlogram suggests that both contain a cyclical trend of around eight datapoints, since the correlogram exhibits a strong minimum at l = ±8.

3.4.3 Multivariate Correlograms

In the real world there may be a large number of variables that change with time, for example the composition of a manufactured product. In a chemical plant the resultant material could depend on a huge number of factors such as the quality of the raw material, the performance of the apparatus and even the time of day, which could relate to who is on shift or small changes in power supplies. Instead of monitoring each factor individually, it is common to obtain an overall statistical indicator, often related to a principal component (see Chapter 4). The correlogram is computed of this mathematical summary of the raw data rather than the concentration of an individual constituent.

Table 3.8 Two time series, for which the cross-correlogram is presented in

Figure 3.15.

Time

Series 1

Series 2

Time

Series 1

Series 2

 

 

 

 

 

 

1

2.768

1.061

16

3.739

2.032

2

2.583

1.876

17

4.192

2.485

3

0.116

0.824

18

1.256

0.549

4

0.110

1.598

19

2.656

3.363

5

0.278

1.985

20

1.564

3.271

6

2.089

2.796

21

3.698

5.405

7

1.306

0.599

22

2.922

3.629

8

2.743

1.036

23

4.136

3.429

9

4.197

2.490

24

4.488

2.780

10

5.154

4.447

25

5.731

4.024

11

3.015

3.722

26

4.559

3.852

12

1.747

3.454

27

4.103

4.810

13

0.254

1.961

28

2.488

4.195

14

1.196

1.903

29

2.588

4.295

15

3.298

2.591

30

3.625

4.332

 

 

 

 

 

 

SIGNAL PROCESSING

147

 

 

 

 

 

 

0.8

 

 

 

 

 

 

 

 

 

0.6

 

 

 

 

 

 

 

 

 

0.4

 

 

 

 

 

 

 

 

 

0.2

Correlation

 

 

 

 

 

 

 

 

0

 

 

 

Lag

 

 

 

 

 

 

 

 

−20

−15

−10

−5

0

 

5

10

15

20

 

 

 

 

−0.2

 

 

 

 

 

 

 

 

 

−0.4

 

 

 

 

 

Figure 3.15

Two time series (top) and their corresponding cross-correlogram (bottom)

3.5 Fourier Transform Techniques

The mathematics of Fourier transformation (FT) has been well established for two centuries, but early computational algorithms were first applied in the 1960s, a prime method being the Cooley–Tukey algorithm. Originally employed in physics and engineering, FT techniques are now essential tools of the chemist. Modern NMR, IR and X-ray spectroscopy, among others, depend on FT methods. FTs have been extended to two-dimensional time series, plus a wide variety of modifications, for example phasing, resolution enhancement and applications to image analysis have been developed over the past two decades.

3.5.1Fourier Transforms

3.5.1.1General Principles

The original literature on Fourier series and transforms involved applications to continuous datasets. However, in chemical instrumentation, data are not sampled continuously but at regular intervals in time, so all data are digitised. The discrete Fourier transform (DFT) is used to process such data and will be described below. It is important to recognise that DFTs have specific properties that distinguish them from continuous FTs.

DFTs involve transformation between two types of data. In FT-NMR the raw data are acquired at regular intervals in time, often called the time domain, or more specifically

148

CHEMOMETRICS

 

 

a free induction decay (FID). FT-NMR has been developed over the years because it is much quicker to obtain data than using conventional (continuous wave) methods. An entire spectrum can be sampled in a few seconds, rather than minutes, speeding up the procedure of data acquisition by one to two orders of magnitude. This has meant that it is possible to record spectra of small quantities of compounds or of natural abundance of isotopes such as 13C, now routine in modern chemical laboratories.

The trouble with this is that the time domain is not easy to interpret, and here arises the need for DFTs. Each peak in a spectrum can be described by three parameters, namely a height, width and position, as in Section 3.2.1. In addition, each peak has a shape; in NMR this is Lorentzian. A spectrum consists of a sum of peaks and is often referred to as the frequency domain. However, raw data, e.g. in NMR are recorded in the time domain and each frequency domain peak corresponds to a time series characterised by

an initial intensity;

an oscillation rate; and

a decay rate.

The time domain consists of a sum of time series, each corresponding to a peak in the spectrum. Superimposed on this time series is noise. Fourier transforms convert the time series to a recognisable spectrum as indicated in Figure 3.16. Each parameter in the time domain corresponds to a parameter in the frequency domain as indicated in Table 3.9.

The faster the rate of oscillation in the time series, the further away the peak is from the origin in the spectrum.

The faster the rate of decay in the time series, the broader is the peak in the spectrum.

The higher the initial intensity in the time series, the greater is the area of the transformed peak.

Initial intensity

Decay rate

FOURIER TRANSFORM

Width

 

 

Area

 

Position

Oscillation frequency

Time domain

Frequency domain

Figure 3.16

Fourier transformation from a time domain to a frequency domain

SIGNAL PROCESSING

149

 

 

 

 

 

Table 3.9 Equivalence

between parameters

 

in the time domain and frequency domain.

 

 

 

 

 

Time domain

Frequency domain

 

 

 

 

 

Initial intensity

Peak area

 

Oscillation frequency

Peak position

 

Decay rate

Peak width

 

 

 

 

The peakshape in the frequency domain relates to the decay curve (or mechanism) in the time domain. The time domain equivalent of a Lorentzian peak is

f (t) = A cos(ωt)et/s

where A is the initial height (corresponding to the area in the transform), ω is the oscillation frequency (corresponding to the position in the transform) and s is the decay rate (corresponding to the peak width in the transform). The key to the lineshape is the exponential decay mechanism, and it can be shown that a decaying exponential transforms into a Lorentzian. Each type of time series has an equivalent in peakshape in the frequency domain, and together these are called a Fourier pair. It can be shown that a Gaussian in the frequency domain corresponds to a Gaussian in the time domain, and an infinitely sharp spike in the frequency domain to a nondecaying signal in the time domain.

In real spectra, there will be several peaks, and the time series appear much more complex than in Figure 3.16, consisting of several superimposed curves, as exemplified in Figure 3.17. The beauty of Fourier transform spectroscopy is that all the peaks can

1

14

27

40

53

66

79

92

105

118

131

144

157

170

183

196

209

222

235

248

261

274

287

300

313

326

339

352

365

378

391

404

417

430

443

456

469

482

495

Figure 3.17

Typical time series consisting of several components

Соседние файлы в предмете Химия