Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Brereton Chemometrics

.pdf
Скачиваний:
48
Добавлен:
15.08.2013
Размер:
4.3 Mб
Скачать

APPENDICES

463

 

 

Figure A.38

Obtaining the pseudoinverse in Matlab

the pseudoinverse can simply be obtained by the function pinv, without any further commands; see Figure A.38.

For a comprehensive list of facilities, see the manuals that come with Matlab, or the help files; however, a few that are useful to the reader of this book are as follows. The size function gives the dimensions of a matrix, so size(W) will return a 2 × 1 vector with elements, in our example, of 2 and 3. It is possible to create a new vector, for example, s = size(W); in such a situation s(1) will equal 2, or the number of rows. The element W(s(1), s(2)) represents the last element in the matrix W. In addition, it is possible to use the functions size(W,1) and size(W,2) which provide the number of rows and columns directly. These functions are very useful when writing simple programs as discussed below.

The mean function can be used in various ways. By default this function produces the mean of each column in a matrix, so that mean(W) results in a 1 × 3 row vector containing the means. It is possible to specify which dimension one wishes to take the mean over, the default being the first one. The overall mean of an entire matrix is obtained using the mean function twice, i.e. mean(mean(W)). Note that the mean of a vector is always a single number whether the vector is a column or row vector. This function is illustrated in Figure A.39. Similar syntax applies to functions such as min, max and std, but note that the last function calculates the sample rather than population standard deviation and if employed for scaling in chemometrics, you must convert back to the sample standard deviation, in the current case by typing std(W)/sqrt((s(1))/(s(1)-1)), where sqrt is a function that calculates the square root and s contains the number of rows in the matrix. Similar remarks apply to the var function, but it is not necessary use a square root in the calculation.

The norm function of a matrix is often useful and consists of the square root of the sum of squares, so in our example norm(W) equals 12.0419. This can be useful when scaling data, especially for vectors. Note that if Y is a row vector, then sqrt(Y*Y’) is the same as norm(Y).

It is useful to combine some of these functions, for example min(s) would be the minimum dimension of matrix W. Enthusiasts can increase the number of variables

464

CHEMOMETRICS

 

 

Figure A.39

Mean function in Matlab

within a function, an example being min([s 2 4]), which finds the minimum of all the numbers in vector s together with 2 and 4. This facility can be useful if it is desired to limit to number of principal components or eigenvalues displayed. If Spec is a spectral matrix of variable dimensions, and we know that we will never have more than 10 significant components, then min([size(Spec)] 10) will choose a number that is the minimum of the two dimensions of Spec or equals 10 if this value is larger.

Some functions operate on individual elements rather than rows or columns. For example, sqrt(W) results in a new matrix of dimensions identical with W containing the square root of all the elements. In most cases whether a function returns a matrix, vector or scalar is commonsense, but there are certain linguistic features, a few rather historical, so if in doubt test out the function first.

A.5.4.4 Preprocessing

Preprocessing is slightly awkward in Matlab. One way is to write a small program with loops as described in Section A.5.6. If you think in terms of vectors and matrices, however, it is fairly easy to come up with a simple approach. If W is our original 2 × 3 matrix and we want to mean centre the columns, we can easily obtain a 1 × 3 vector w

APPENDICES

465

 

 

Figure A.40

Mean centring a matrix in Matlab

which corresponds to the means of each column, multiply this by a 2 × 1 vector 1 giving a 2 × 3 vector consisting of the means, and so our new mean centred matrix V can be calculated as V = W 1.w as illustrated in Figure A.40. There is a special function in Matlab called ones that also creates vectors or matrices that just consist of the number 1, there being several ways of using this, but an array ones (5,3) would create a matrix of dimensions 5 × 3 solely of 1s, so a 2 × 1 vector could be specified using the function ones(2,1) as an alternative to the approach illustrated in the figure.

The experienced user of Matlab can build on this to perform other common methods for preprocessing, such as standardisation.

A.5.4.5 Principal Components Analysis

PCA is simple in Matlab. The singular value decomposition (SVD) algorithm is employed, but this should normally give equivalent results to NIPALS except that all the PCs are calculated at once. One difference is that the scores and loadings are both normalised, so that for SVD

X = U .S .V

where, using the notation elsewhere in the text,

T = U .S

and

V = P

466 CHEMOMETRICS

The matrix V is equivalent to T but the sum of squares of the elements of each row equals 1, and S is a matrix, whose dimensions equal the number of PCs whose diagonal elements equal the square root of the eigenvalues and the remaining elements equal 0. The command svd(X) will display the nonzero values of g or the square roots of the eigenvalues. To obtain the scores, loadings and eigenvalue matrices, use the command [U,S,V] = svd(X). Note that the dimensions of these three matrices differ slightly from those above in that S is not a square matrix, and U and V are square matrices with their respective dimensions equal to the number of rows and columns in the original data. If X is an I × J matrix then U will be a matrix of dimensions I × I , S of dimensions I × J (the same as the original matrix) and V of dimensions J × J . To obtain a scores matrix equivalent to that using the NIPALS algorithm, simply calculate T = U * S. The sum of squares each column of T will equal the corresponding eigenvalue (as defined in this text). Note that if J > I columns I + 1 to J of matrix V will have no meaning, and equivalently if I > J the last columns of matrix U will have no meaning. The Matlab SVD scores and loadings matrices are square matrices.

One problem about the default method for SVD is that the matrices can become rather large if there are many variables, as often happens in spectroscopy or chromatography. There are a number of ways of reducing the size of the matrices if we want to calculate only a few PCs, the simplest being to use the svds function; the second argument restricts the number of PCs. Thus svds(X,5) calculates the first five PCs, so if the original data matrix was of dimensions 25 × 100, U becomes 25 × 5, S becomes 5 × 5 (containing only five nonzero values down the diagonals) and V becomes 5 × 100.

A.5.5 Numerical Data

In chemometrics we want to perform operations on numerical data. There are many ways of getting information into Matlab generally straight into matrix format. Some of the simplest are as follows.

1.Type the numerical information in as described above.

2.If the information is available in a space delimited form with each row on a separate line, for example as a text file, copy the data, type a command such as

X = [

but do NOT terminate this by the enter key, then paste the data into the Matlab window and finally terminate with

]

using a semicolon if you do not want to see the data displayed again (useful if the original dataset is large such as a series of spectra).

3.Information can be saved as mat files (see Section A.5.3.1) and these can be imported into Matlab. Many public domain chemometrics datasets are stored in this format.

In addition, there are a huge number of tools for translating from a variety of common formats, such as Excel, and the interested reader should refer to the relevant source manuals where appropriate.

APPENDICES

467

 

 

A.5.6 Introduction to Programming and Structure

For the enthusiasts it is possible to write quite elaborate programs and develop very professional looking m files. The beginner is advised to have a basic idea of a few of the main features of Matlab as a programming environment.

First and foremost is the ability to make comments (statements that are not executed), by starting a line with the % sign. Anything after this is simply ignored by Matlab but helps make large m files comprehensible.

Loops commence with the for statement, which has a variety of different syntaxes, the simplest being for i = begin : end which increments the variable i from the number begin (which must be a scalar) to end. An increment (which can be negative and does not need to be an integer) can be specified using the syntax for i = begin : inc : end; notice how, unlike many programming languages, this is the middle value of the three variables. Loops finish with the end statement. As an example, the operation of mean centring (Section A.5.4.4) is written in the form of a loop; see Figure A.41. The interested reader should be able to interpret the commands using the information given above. Obviously for this small operation a loop is not strictly necessary, but for more elaborate programs it is important to be able to use loops, and there is a lot of flexibility about addressing matrices which make this facility very useful.

If and while facilities are also useful to the programmer.

Figure A.41

A simple loop used for mean centring

468

CHEMOMETRICS

 

 

Many programmers like to organise their work into functions. In this introductory text we will not delve too far into this, but a library of m files that consist of different functions can be easily set up. In order to illustrate this, we demonstrate a simple function called twoav that takes a matrix, calculates the average of each column and produces a vector consisting of two times the column averages. The function is stored in an m file called twoav in the current working directory. This is illustrated in Figure A.42. Note that the m file must start with the function statement, and the name of the function should correspond to the name of the m file. The arguments (in this case a matrix which is called p within the function and can be called anything in

Figure A.42

A simple function and its result

APPENDICES

469

 

 

the main program, – W in this example) are place in brackets after the function name. The array o contains the result of the expression that is passed back.

A.5.7 Graphics

There are a large number of different types of graph available in Matlab. Below we discuss a few methods that can be used to produce diagrams of the type employed in this text. The enthusiast will soon discover further approaches. Matlab is a very powerful tool for data visualisation.

A.5.7.1 Creating Figures

There are several ways to create new graphs. The simplest is by a plotting command as discussed in the next sections. A new window consisting of a figure is created. Unless indicated otherwise, each time a graphics command is executed, the graph in the figure window is overwritten.

In order to organise the figures better, it is preferable to use the figure command. Each time this is typed in the Matlab command window, a new blank figure as illustrated in Figure A.43 is produced, so typing this three times in succession results in three blank figures, each of which is able to contain a graph. The figures are automatically numbered from 1 onwards. In order to return the second figure (number 2), simply type figure(2). All plotting commands apply to the currently open figure. If you wish to produce a graph in the most recently opened window, it is not necessary to specify a number. Therefore, if you were to type the command three times, unless specified otherwise the current graph will be displayed in Figure 3. The figures can be accessed either as small icons or through the Window menu item. It is possible to skip figure numbers, so the command figure(10) will create a figure number 10, even if no other figures have been created.

If you want to produce several small graphs on one figure, use the subplot command. This has the syntax subplot(n,m,i). It divides the figure into n × m small graphs and puts the current plot into the ith position, where the first row is numbered from 1 to m, the second from m + 1 to 2m, and so on. Figure A.44 illustrates the case where the commands subplot(2,2,1) and subplot(2,2,3) have been used to divide the window into a 2 × 2 grid, capable of holding up to four graphs, and figures have been inserted into positions 1 (top left) and 3 (bottom left). Further figures can be inserted into the grid in the vacant positions, or the current figures can be replaced and overwritten.

New figures can also be created using the File menu, and the New option, but it is not so easy to control the names and so probably best to use the figure command.

Once the figure is complete you can copy it using the Copy Figure menu item and then place it in documents. In this section we will illustrate the figures by screen snapshots showing the grey background of the Matlab screen. Alternatively, the figures can be saved in Matlab format, using the menu item under the current directory, as a fig file, which can then be opened and edited in Matlab in the future.

A.5.7.2 Line Graphs

The simplest type of graph is a line graph. If Y is a vector then plot(Y) will simply produce a graph of each element against row number. Often we want to plot a row

470

CHEMOMETRICS

 

 

Figure A.43

Blank figure window

or column of a matrix against element number, for example if each successive point corresponds to a point in time or a spectral wavelength. This is easy to do: the command plot(X(:,2)) plots the second column of X. Plotting a subset is also possible, for example plot(X(11:20,2)) produces a graph of rows 11–20, in practice allowing an expansion of the region of interest.

Once you have produced a line graph it is possible to change its appearance. This is easily done by first clicking the arrow tool in the graphics window, which allows editing of the properties, and then clicking on either the line to change the appearance of the data, or the axes. One useful facility is to make the lines thicker: the default line width of 0.6 is often thin when intended for publication (although it is a good size for displaying on a screen), and it is recommended to increase this to around 2. In addition, one sometimes wishes to mark the points, using the marker facility. The result is presented in Figure A.45. If you do not wish to join up the points with a line you can select a line style ‘none’. The appearance of the axes can also be altered. There are various commands to change the nature of these plots, and you are recommended to use the Matlab help facility for further information.

APPENDICES

471

 

 

Figure A.44

Use of multiple plot facility

It is possible to superimpose several line graphs, for example if X is a matrix with five columns, then the command plot(X) will superimpose five graphs in one picture.

Note that you can further refine the appearance of the plot using the tools to create labels, extra lines and arrows.

A.5.7.3 Two Variables Against Each Other

The plot command can also be used to plot two variables against each other. It is common to plot columns of matrices against each other, for example when producing a PC plot of the scores of one PC against another. The command plot(X(:,2), X(:,3)) produces a graph of the third column of X against the second column. If you do not want to join the points up with a line you can either use the graphics editor as in Section A.5.7.2, or else the scatter command, which has a similar syntax but by default simply presents each point as a symbol. This is illustrated in Figure A.46.

A.5.7.4 Labelling Points

Points in a graph can be labelled using the text command. The basic syntax is text (A,B,name), where the A and B are arrays with the same number of elements, and it is recommended that name is an array of names or characters likewise with the

472

CHEMOMETRICS

 

 

Figure A.45

Changing the properties of a graph in Matlab

Соседние файлы в предмете Химия