Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Казанский национальный исследовательский технологический университет

Предмет:

Химия

Файл:

Brereton Chemometrics

.pdf

Скачиваний:

Добавлен:

15.08.2013

Размер:

4.3 Mб

Скачать

☆

<<< < Предыдущая 14 15 16 17 18 19 20 21 22 23 24 2526 / 5026 27 28 29 30 31 32 33 34 35 36 37 38 > Следующая >>>

PATTERN RECOGNITION						241

	Table 4.26 Mahalanobis distances from class centroids.

	Object	Distance to	Distance to		True	Predicted
		Class A		Class B	classiﬁcation	classiﬁcation

1		1.25		5.58	A	A
2		0.13		3.49	A	A
3		0.84		2.77	A	A
4		1.60		3.29	A	A
5		1.68		1.02	A	B
6		1.54		0.67	A	B
7		0.98		5.11	A	A
8		1.36		5.70	A	A
9		2.27		3.33	A	A
10		2.53		0.71	B	B
11		2.91		2.41	B	B
12		1.20		1.37	B	A
13		5.52		2.55	B	B
14		1.57		0.63	B	B
15		2.45		0.78	B	B
16		3.32		1.50	B	B
17		2.04		0.62	B	B
18		1.21		1.25	B	A
19		1.98		0.36	B	B
6

		1 8
		1 8
5		7
5
4
4
2		4		9
		4		9
3		3				13
3

				11
				11
2		12			16
2


		18
1		5		15
1		5
		6
		6	17	10
		14	17
			19

Figure 4.31

Class distance plot: horizontal axis = class A, vertical axis = class B

242	CHEMOMETRICS

4.5.2.4 Extending the Methods

There are numerous extensions to the basic method in the section above, some of which are listed below.

•Probabilities of class membership and class boundaries or conﬁdence intervals can be constructed, assuming a multivariate normal distribution.

•Discriminatory power of variables. This is described in a different context in Section 4.5.3, in the context of SIMCA, but similar parameters can be obtained for any method of discrimination. This can have practical consequences; for example, if variables are expensive to measure it is possible to reduce the expense by making only the most useful measurements.

•The method can be extended to any number of classes. It is easy to calculate the Mahalanobis distance to several classes, and determine which is the most appropriate classiﬁcation of an unknown sample, simply by ﬁnding the smallest class distance. More discriminant functions are required, however, and the computation can be rather complicated.

•Instead of using raw data, it is possible to use the PCs of the data. This acts as a form of variable reduction, but also simpliﬁes the distance measures, because the variance–covariance matrix will only contain nonzero elements on the diagonals. The expressions for Mahalanobis distance and linear discriminant functions simplify dramatically.

•One interesting modiﬁcation involves the use of Bayesian classiﬁcation functions. Such concepts have been introduced previously (see Chapter 3) in the context of signal analysis. The principle is that membership of each class has a predeﬁned (prior) probability, and the measurements are primarily used to reﬁne this. If we are performing a routine screen of blood samples for drugs, the chances might be very low (less than 1 %). In other situations an expert might already have done preliminary tests and suspects that a sample belongs to a certain class, so the prior probability of class membership is much higher than in the population as a whole; this can be taken into account. Normal statistical methods assume that there is an equal chance of membership of each class, but the distance measure can be modiﬁed as follows:

diA = (x − xA).C−A1.(xi − xA) + 2 ln(qA)

where qA is the prior probability of membership of class A. The prior probabilities of membership of all relevant classes must add up to 1. It is even possible to use the Bayesian approach to reﬁne class membership as more data are acquired. Screening will provide a probability of class membership which can then be used as prior probability for a more rigorous test, and so on.

•Several other functions such as the quadratic and regularised discriminant functions have been proposed and are suitable under certain circumstances.

It is always important to examine each speciﬁc paper and software package for details on the parameters that have been calculated. In many cases, however, fairly straightforward approaches sufﬁce.

PATTERN RECOGNITION	243

4.5.3 SIMCA

The SIMCA method, ﬁrst advocated by the S. Wold in the early 1970s, is regarded by many as a form of soft modelling used in chemical pattern recognition. Although there are some differences with linear discriminant analysis as employed in traditional statistics, the distinction is not as radical as many would believe. However, SIMCA has an important role in the history of chemometrics so it is important to understand the main steps of the method.

4.5.3.1 Principles

The acronym SIMCA stands for soft independent modelling of class analogy. The idea of soft modelling is illustrated in Figure 4.32. Two classes can overlap (and hence are ‘soft’), and there is no problem with an object belonging to both (or neither) class simultaneously. In most other areas of statistics, we insist that an object belongs to a discrete class, hence the concept of hard modelling. For example, a biologist trying to determine the sex of an animal from circumstantial evidence (e.g. urine samples) knows that the animal cannot simultaneously belong to two sexes at the same time, and a forensic scientist trying to determine whether a banknote is forged or not knows

Figure 4.32

Two overlapping classes

244	CHEMOMETRICS

that there can be only one true answer: if this appears not to be so, the problem lies with the quality of the evidence. The philosophy of soft modelling is that, in many situations in chemistry, it is entirely legitimate for an object to ﬁt into more than one class simultaneously, for example a compound may have an ester and an alkene group, and so will exhibit spectroscopic characteristics of both functionalities, so a method that assumes that the answer must be either an ester or an alkene is unrealistic. In practice, though, it is possible to calculate class distances from discriminant analysis (Section 4.5.2.3) that are close to two or more groups.

Independent modelling of classes, however, is a more useful feature. After making a number of measurements on ketones and alkenes, we may decide to include amides in the model. Figure 4.33 represents the effect of this additional class which can be added independently to the existing model without any changes. In contrast, using classical discriminant analysis the entire modelling procedure must be repeated if extra numbers of groups are added, since the pooled variance–covariance matrix must be recalculated.

4.5.3.2 Methodology

The main steps of SIMCA are as follows.

Principal Components Analysis

Each group is independently modelled using PCA. Note that each group could be described by a different number of PCs. Figure 4.34 represents two groups, each

Figure 4.33

Three overlapping classes

PATTERN RECOGNITION

245

Figure 4.34

Two groups obtained from three measurements

characterised by three raw measurements, e.g. chromatographic peak heights or physical properties. However, one group falls mainly on a straight line, deﬁned as the ﬁrst principal component of the group, whereas the other falls roughly on a plane whose axes are deﬁned by the ﬁrst two principal components of this group. When we perform discriminant analysis we can also use PCs prior to classiﬁcation, but the difference is that the PCs are of the entire dataset (which may consist of several groups) rather than of each group separately.

It is important to note that a number of methods have been proposed for determining how many PCs are most suited to describe a class, which have been described in Section 4.3.3. The original advocates of SIMCA used cross-validation, but there is no reason why one cannot ‘pick and mix’ various steps in different methods.

Class Distance

The class distance can be calculated as the geometric distance from the PC models; see the illustration in Figure 4.35. The unknown is much closer to the plane formed than the line, and so is tentatively assigned to this class. A more elaborate approach is often employed in which each group is bounded by a region of space, which represents 95 % conﬁdence that a particular object belongs to a class. Hence geometric class distances can be converted to statistical probabilities.

Modelling Power

The modelling power of each variable for each separate class is deﬁned by

Mj = 1 − sjresid /sjraw

246

CHEMOMETRICS

Figure 4.35

Class distance of unknown object (represented by an asterisk) to two classes in Figure 4.34

where sjraw is the standard deviation of the variable in the raw data and sjresid the standard deviation of the variable in the residuals given by

E = X − T .P

which is the difference between the observed data and the PC model for the class. This is illustrated in Table 4.27 for a single class, with the following steps.

1.The raw data consist of seven objects and six measurements, forming a 7 × 6 matrix. The standard deviation of all six variables in calculated.

2.PCA is performed on this data and two PCs are retained. The scores (T) and loadings (P) matrices are computed.

3.The original data are estimated by PCA by multiplying T and P together.

4.The residuals, which are the difference between the datasets obtained in steps 1 and 3, are computed (note that there are several alternative ways of doing this). The standard deviation of the residuals for all six variables is computed.

5.Finally, using the standard deviations obtained in steps 1 and 4, the modelling power of the variables for this class is obtained.

The modelling power varies between 1 (excellent) and 0 (no discrimination). Variables with M below 0.5 are of little use. In this example, it can be seen that variables 4 and 6 are not very helpful, whereas variables 2 and 5 are extremely useful. This information can be used to reduce the number of measurements.

PATTERN RECOGNITION	247

Table 4.27 Calculation of modelling power in SIMCA.

1.	Raw data	24.990	122.326	68.195	28.452	105.156	48.585
		24.990	122.326	68.195	28.452	105.156	48.585
		14.823	51.048	57.559	21.632	62.764	26.556
		28.612	171.254	114.479	43.287	159.077	65.354
		41.462	189.286	113.007	25.289	173.303	50.215
		5.168	50.860	37.123	16.890	57.282	35.782
		20.291	88.592	63.509	14.383	85.250	35.784
		31.760	151.074	78.699	34.493	141.554	56.030
sjraw		10.954	51.941	26.527	9.363	43.142	12.447
2.	PCA

Scores			Loadings

185.319		0.168	0.128	−0.200
103.375		19.730	0.636	−0.509
273.044		9.967	0.396	0.453
288.140		−19.083	0.134	0.453
92.149		15.451	0.594	−0.023
144.788		3.371	0.229	0.539
232.738		−5.179
3. Data estimated by PCA
		23.746	117.700	73.458	24.861	110.006	42.551
		9.325	55.652	49.869	22.754	60.906	34.313
		33.046	168.464	112.633	41.028	161.853	67.929
		40.785	192.858	105.454	29.902	171.492	55.739
		8.739	50.697	43.486	19.316	54.342	29.436
		17.906	90.307	58.859	20.890	85.871	34.990
		30.899	150.562	89.814	28.784	138.280	50.536
4.	Residuals	1.244	4.626	−5.263	3.591	−4.850	6.034
		1.244	4.626	−5.263	3.591	−4.850	6.034
		5.498	−4.604	7.689	−1.122	1.858	−7.757
		−4.434	2.790	1.846	2.259	−2.776	−2.575
		0.677	−3.572	7.553	−4.612	1.811	−5.524
		−3.570	0.163	−6.363	−2.426	2.940	6.346
		2.385	−1.715	4.649	−6.508	−0.621	0.794
		0.861	0.512	−11.115	5.709	3.274	5.494
	sjresid	3.164	3.068	6.895	4.140	2.862	5.394
5.	Modelling power
		0.711	0.941	0.740	0.558	0.934	0.567

Discriminatory Power

Another measure is how well a variable discriminates between two classes. This is distinct from modelling power – being able to model one class well does not necessarily imply being able to discriminate two groups effectively. In order to determine this, it is necessary to ﬁt each sample to both class models. For example, ﬁt sample 1 to the PC model of both class A and class B. The residual matrices are then calculated, just as for discriminatory power, but there are now four such matrices:

1.samples in class A ﬁtted to the model of class A;

2.samples in class A ﬁtted to the model of class B;

248	CHEMOMETRICS

3.samples in class B ﬁtted to the model of class B;

4.samples in class B ﬁtted to the model of class A.

We would expect matrices 2 and 4 to be a worse ﬁt than matrices 1 and 3. The standard deviations are then calculated for these matrices to give

Dj =	classAmodelB sjresid	2	classBmodelAsjresid 2
			+
	classAmodelAsjresid	2	classBmodelB sjresid 2
			+

The bigger the value, the higher is the discriminatory power. This could be useful information, for example if clinical or forensic measurements are expensive, so allowing the experimenter to choose only the most effective measurements. Discriminatory power can be calculated between any two classes.

4.5.3.3 Validation

Like all methods for supervised pattern recognition, testing and cross-validation are possible in SIMCA. There is a mystique in the chemometrics literature whereby some general procedures are often mistakenly associated with a particular package or algorithm; this is largely because the advocates promote speciﬁc strategies that form part of papers or software that is widely used. It also should be recognised that there are two different needs for validation: the ﬁrst is to determine the optimum number of PCs for a given class model, and the second to determine whether unknowns are classiﬁed adequately.

4.5.4 Discriminant PLS

An important variant that is gaining popularity in chemometrics circles is called discriminant PLS (DPLS). We describe the PLS method in more detail in Chapter 5, Section 5.4 and also Appendix A.2, but it is essentially another approach to regression, complementary to MLR with certain advantages in particular situations. The principle is fairly simple. For each class, set up a model

cˆ = T .q

where T are the PLS scores obtained from the original data, q is a vector, the length equalling the number of signiﬁcant PLS components, and cˆ is a class membership function; this is obtained by PLS regression from an original c vector whose elements have values of 1 if an object is a member of a class and 0 otherwise and an X matrix

consisting of the original preprocessed data. If there are three classes, then a matrix C, of dimensions I × 3, can be obtained from a training set consisting of I objects; the closer each element is to 1, the more likely an object is to be a member of the particular class. All the normal procedures of training and test sets, and cross-validation, can be used with DPLS, and various extensions have been proposed in the literature.

MLR regression on to a class membership function would not work well because the aim of conventional regression is to model the data exactly and it is unlikely that a good relationship could be obtained, but PLS does not require an exact ﬁt to the data, so DPLS is more effective. Because we describe the PLS method in detail in

PATTERN RECOGNITION	249

Chapter 5, we will not give a numerical example in this chapter, but if the number of variables is fairly large, approaches such as Mahalanobis distance are not effective unless there is variable reduction either by ﬁrst using PCA or simply selecting some of the measurements, so the approach discussed in this section is worth trying.

4.5.5 K Nearest Neighbours

The methods of SIMCA, discriminant analysis and DPLS involve producing statistical models, such as principal components and canonical variates. Nearest neighbour methods are conceptually much simpler, and do not require elaborate statistical computations.

The KNN (or K nearest neighbour) method has been with chemists for over 30 years. The algorithm starts with a number of objects assigned to each class. Figure 4.36 represents ﬁve objects belonging to two classes A and class B recorded using two measurements, which may, for example, be chromatographic peak areas or absorption intensities at two wavelengths; the raw data are presented in Table 4.28.

4.5.5.1 Methodology

The method is implemented as follows.

1.Assign a training set to known classes.

2.Calculate the distance of an unknown to all members of the training set (see Table 4.28). Usually a simple ‘Euclidean’ distance is computed, see Section 4.4.1.

3.Rank these in order (1 = smallest distance, and so on).

2 x

10.00

9.00

8.00

Class B

7.00

6.00

5.00						Class A
5.00
4.00
4.00
3.00
3.00
2.00
2.00
1.00
1.00
0.00
0.00
0.00	2.00	4.00	6.00		8.00		10.00	12.00
				x1

Figure 4.36

Two groups

250					CHEMOMETRICS

	Table 4.28 Example for KNN classiﬁcation: the three closest distances
	are indicated in bold.

	Class	x1	x2	Distance to unknown	Rank
	A	5.77	8.86	3.86	6
	A	10.54	5.21	5.76	10
	A	7.16	4.89	2.39	4
	A	10.53	5.05	5.75	9
	A	8.96	3.23	4.60	8
	B	3.11	6.04	1.91	3
	B	4.22	6.89	1.84	2
	B	6.33	8.99	4.16	7
	B	4.36	3.88	1.32	1
	B	3.54	8.28	3.39	5
	Unknown	4.78	5.13

2 x

10.00
9.00
8.00
7.00
6.00
5.00
4.00
3.00
2.00
1.00
0.00
0.00	2.00	4.00	6.00	8.00	10.00	12.00
			x1

Figure 4.37

Three nearest neighbours to unknown

4.Pick the K smallest distances and see what classes the unknown in closest to; this number is usually a small odd number. The case where K = 3 is illustrated in Figure 4.37 for our example. All objects belong to class B.

5.Take the ‘majority vote’ and use this for classiﬁcation. Note that if K = 5, one of the ﬁve closest objects belongs to class A in this case.

6.Sometimes it is useful to perform KNN analysis for a number of values of K, e.g. 3, 5 and 7, and see if the classiﬁcation changes. This can be used to spot anomalies or artefacts.

<<< < Предыдущая 14 15 16 17 18 19 20 21 22 23 24 2526 / 5026 27 28 29 30 31 32 33 34 35 36 37 38 > Следующая >>>

Соседние файлы в предмете Химия

#
15.08.20134.29 Mб17Baer M., Billing G.D. (eds.) - The role of degenerate states in chemistry (Adv.Chem.Phys. special issue, Wiley, 2002).pdf
#
15.08.20137.08 Mб55Basov N.I. i dr. Raschet i konstruirovanie formiruyushchego instrumenta dlya izgotovleniya izdelij (1991.pdf
#
15.08.20135.59 Mб68Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
#
15.08.2013324.82 Кб32benzyne-cyclization.pdf
#
15.08.201314.48 Mб18Borowko M. 2000 Computational methods in surface and colloid science.djvu
#
15.08.20134.3 Mб48Brereton Chemometrics.pdf
#
15.08.20131.07 Mб30Burshtejn K.Ya., Shorygin P.P. Kvantovohimicheskie raschety v organicheskoj himii i molekulyarnoj.pdf
#
15.08.201321.36 Mб45Carey F.A. - Organic Chemistry (2004)(en).djvu
#
15.08.201321.36 Mб39Carey F.A. Advanced organic chemistry 5ed., MGH, 2004.djvu
#
15.08.201311.62 Mб23Carey F.A. Advanced organic chemistry. Part A structure and mechanisms 1938.djvu
#
15.08.20138.77 Mб17Carey F.A. Advanced organic chemistry. Part B reaction and synthesis 1938.djvu