Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

1greenacre_m_primicerio_r_multivariate_analysis_of_ecological

.pdf
Скачиваний:
3
Добавлен:
19.11.2019
Размер:
7.36 Mб
Скачать

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

matches is a d, and mismatches b c. The two measures of distance/dissimilarity considered so far are thus defined as:

Matching coefficient dissimilarity:

b + c

 

 

a + d

(5.5)

a + b + c

+ d

= 1 a + b + c + d

 

 

Jaccard index dissimilarity:

 

b + c

 

a

 

(5.6)

a + b + c

= 1 a + b + c

 

 

Distances for mixedscale data

To give one final example, the correlation coefficient can be used to measure the similarity between two vectors of dichotomous data, and can be shown to be equal to:

r =

ad bc

 

(5.7)

(a + b)(c + d)(a

 

 

+ c)(b + d)

Hence, a dissimilarity can be defined as 1 r. Since 1 r has a range from 0 (when b c 0, no mismatches) to 2 (when a d 0, no matches), a convenient measure between 0 and 1 is ½ (1 r).

When a data set contains different types of variables and it is required to measure inter-sample distance, we are faced with another problem of standardization: how can we balance the contributions of these different types of variables in an equitable way? We will demonstrate two alternative ways of doing this. The following is an example of mixed data (shown here are the data for four stations out of a set of 33:

STATION

 

CONTINUOUS VARIABLES

 

 

DISCRETE VARIABLES

 

 

 

 

 

 

 

 

Depth

Temperature

Salinity

 

Region

Substrate

 

 

 

 

 

 

 

s3

30

3.15

33.52

 

Ta

Si/St

s8

29

3.15

33.52

 

Ta

Cl/Gr

s25

30

3.00

33.45

 

Sk

Cl/Sa

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

s84

66

3.22

33.48

 

St

Cl

 

 

 

 

 

 

 

Apart from the three continuous variables, depth, temperature and salinity there are the categorical variables of region (Tarehola, Skognes, Njosken and Storura), and substrate character (which can be any selection of clay, silt, sand, gravel and

70

MEASURES OF DISTANCE BETWEEN SAMPLES: NON-EUCLIDEAN

stone). The fact that more than one substrate category can be selected implies that each category is a separate dichotomous variable, so that substrate consists of five different variables.

The first way of standardizing the continuous against the discrete variables is called GowerÕs generalized coefÞcient of dissimilarity. First we express the discrete variables as dummies and calculate the means and standard deviations of all variables in the usual way:

STATION

CONTINUOUS VARIABLES

 

SAMPLED REGION

 

 

SUBSTRATE CHARACTER

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Depth

Temperature

Salinity

 

Tarehola

Skognes

Njosken

Storura

 

Clay

Silt

Sand

Gravel

Stone

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s3

30

3.15

33.52

1

0

0

0

0

1

0

0

1

s8

29

3.15

33.52

1

0

0

0

1

0

0

1

0

s25

30

3.00

33.45

0

1

0

0

1

0

1

0

0

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

s84

66

3.22

33.48

0

0

0

1

1

0

0

0

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mean

58.15

3.086

33.50

0.242

0.273

0.242

0.242

0.606

0.152

0.364

0.182

0.061

sd

32.45

0.100

0.076

0.435

0.452

0.435

0.435

0.496

0.364

0.489

0.392

0.242

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Notice that dichotomous variables (such as the substrate categories) are coded as a single dummy variable, not two, while polychotomous variables such as region are split into as many dummies as there are categories. The next step is to standardize each variable and multiply all the columns corresponding to dummy variables by 1/ 2 0.7071, a factor which compensates for their higher variance due to the 0/1 coding:

STATION CONTINUOUS VARIABLES

 

SAMPLED REGION

 

 

SUBSTRATE CHARACTER

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Depth

Temperature

Salinity

 

Tarehola Skognes Njosken Storura

Clay

Silt

Sand

Gravel

Stone

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s3

 

 

 

 

 

 

 

 

 

 

 

 

 

–0.868

0.615

0.260

1.231

–0.426

–0.394 –0.394 –0.864

1.648

–0.526

–0.328

2.741

s8

–0.898

0.615

0.260

1.231

–0.426

–0.394 –0.394

0.561

–0.294

–0.526

1.477 –0.177

s25

–0.868

–0.854

–0.676

 

–0.394

1.137

–0.394 –0.394

0.561

–0.294

0.921

–0.328 –0.177

·

·

·

·

·

·

·

·

 

·

·

·

·

·

·

·

·

·

·

·

·

·

 

·

·

·

·

·

·

·

·

·

·

·

·

·

 

·

·

·

·

·

s84

0.242

1.294

–0.294

 

–0.394

–0.426

–0.394

1.231

0.561

–0.294

–0.526

–0.328 –0.177

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Now distances are calculated between the stations using either the L1 (city-block) or L2 (Euclidean) metric. For example, using the L1 metric and dividing the sum

71

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

of absolute differences by the total number of variables (12 in this example), the distances between the above four stations are given in the left hand table of Exhibit 5.7. Because the L1 distance decomposes into parts for each variable, we can show the part of the distance due to the categorical variables, and the part due to the continuous variables. In this example the categorical variables are contributing more to the differences between the stations – the differences in the continuous variables are actually small if one looks at the original data, except for the distance between s84 and s25, where there is a bigger difference in the continuous variables, which then contribute almost the same (0.303) as the categorical ones (0.386).

Exhibit 5.7:

Distances between four stations based on the L1 distance between their standardized and rescaled values, as described above. The distances are shown equal to the part due to the categorical (CAT.) variables plus the part due to the continuous (CONT.) variables

 

 

Total distance

=

Distance CAT. variables

+

Distance CONT. variables

 

s3

 

s8

· · · s25

 

 

s3

s8

· · · s25

 

 

s3

s8

· · · s25

s8

 

 

 

 

 

s8

 

 

 

 

 

 

s8

 

 

 

 

0.677

 

 

 

 

 

0.674

 

 

 

 

0.003

 

 

 

s25

 

 

 

 

 

s25

 

 

 

 

 

 

s25

 

 

 

 

1.110

0.740

 

 

 

0.910

0.537

 

 

 

0.200

0.203

 

 

·

·

·

·

 

·

 

·

·

·

 

+

·

·

·

·

 

·

·

·

· ·

 

= ·

 

·

·

· ·

 

·

·

·

· ·

 

·

·

·

· · ·

 

·

 

·

·

· · ·

 

 

·

·

·

· · ·

 

s84

0.990

0.619

· · · 0.689

s84

 

0.795

0.421

· · · 0.386

 

s84

0.195

0.198

· · · 0.303

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Exhibit 5.7 suggests the alternative way of combining different types of variables: first compute the distances which are the most appropriate for each set and then add them to one another. For example, suppose there are three types of data, a set of continuous variables, a set of categorical variables and a set of percentages or counts. Then compute the distance or dissimilarity matrices D1, D2 and D3 appropriate to each set of same-scale variables, and then combine these in a weighted average:

D =

w1D1 + w2D2 + w3D3

(5.8)

w1 +w2 + w3

 

 

Weights are a subjective but convenient inclusion, not only to account for the different scales in the distance matrices but also because there might be substantive reasons for down-weighting the distances for one set of variables, which might not be so important, or might suffer from high measurement error, for example. A default weighting system could be to make the variance of the distances the same in each matrix: wk 1/sk, where sk is the standard deviation of the distances in matrix Dk.

A third possible way to cope with mixed-scale data such as these would be to fuzzycode the continuous variables, as described in Chapter 3, and then apply a meas-

72

MEASURES OF DISTANCE BETWEEN SAMPLES: NON-EUCLIDEAN

ure of dissimilarity appropriate to categorical data, with possible standardization as also discussed in Chapter 3. We shall make full use of this option in subsequent chapters and the two final case studies.

1.The sum of absolute differences, or L1 distance (or city-block distance), is an alternative to the Euclidean distance: an advantage of this distance is that it

decomposes into contributions made by each variable (for the L2 Euclidean distance, we would need to decompose the squared distance).

2.A well-defined distance function obeys the triangle inequality, but there are several justifiable measures of difference between samples that do not have this property: to distinguish these from true distances we often refer to them as dissimilarities.

3.The Bray-Curtis dissimilarity is frequently used by ecologists to quantify differences between samples based on abundance or count data. This measure is usually applied to raw abundance data, but can be applied to relative abun-

dances just like the chi-square distance, in which case it is equivalent to the L1, or city-block, distance. The chi-square distance can also be applied to the origi-

nal abundances to include overall size differences in the distance measure.

4.A dissimilarity measure for presence absence data is based on the Jaccard index, where co-absences are eliminated from the calculation, otherwise the measure resembles the matching coefficient.

5.Distances based on mixed-scale data can be computed after a process of stand-

ardization of all variables, using the L1 or L2 distances. Alternatively, distance matrices can be calculated for each set of same-scale variables and then these

matrices can be linearly combined, optionally with user-defined weights.

SUMMARY:

Measures of distance between samples: nonEuclidean

73

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

74

CHAPTER 6

Measures of Distance and Correlation

between Variables

In Chapters 4 and 5 we concentrated on distances between samples of a data matrix, which are usually the rows. We now turn our attention to the variables, usually the columns, and we can consider measures of distance and dissimilarity between these column vectors. More often, however, we measure the similarity between variables: this can be in the form of correlation coefficients or other measures of association. In this chapter we shall look at the geometric properties of variables, and various measures of correlation between them. In particular, we shall look at the geometric concept called a scalar product, which is highly related to the concept of Euclidean distance. The decision about which type of correlation function to use depends on the measurement scales of the variables, as we already saw briefly in Chapter 1. Finally, we also consider statistical tests of correlation, introducing the idea of permutation testing.

Contents

The geometry of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Correlation coefficient as an angle cosine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Correlation coefficient as a scalar product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Distances based on correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Distances between count variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Distances between categorical variables and between categories . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Distances between categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Testing correlations: an introduction to permutation testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 SUMMARY: Measures of distance and correlation between variables . . . . . . . . . . . . . . . . . . . . . . . . . 84

In Exhibits 4.3 and 5.5 in the previous chapters we have been encouraging the notion of samples being points in a multidimensional space. Even though we cannot draw points in more than three dimensions, we can easily extend the mathematical definitions of distance to samples for which we have J measurements, for any J. Now, rather than considering the samples, the rows of the data matrix,

The geometry of variables

75

Exhibit 6.1:

(a) Two variables measured in three samples (sites in this case), viewed in three dimensions, using original scales; (b) Standardized values; (c) Same variables plotted in three dimensions using standardized values. Projections of some points onto the “floor” of the s2 s3 plane are shown, to assist in understanding the three-dimensional positions of the points

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

(a)

 

 

 

s1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

100

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

80

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

60

 

 

 

[72

75

 

59]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Depth

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

40

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[4.8

2.8

5.4]

20

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Pollution

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

O

 

20

40

60

 

 

 

 

80

 

100

 

 

s2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

20

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

40

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

60

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

80

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

100

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(b)

 

 

 

SITE

Depth

Pollution

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s1

 

0.392

0.343

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s2

 

0.745

−1.126

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s3

 

−1.137

0.784

 

 

 

 

 

 

 

 

 

 

 

 

(c)

 

 

 

 

 

 

 

 

s1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[0.392

0.745 –1.137]

 

 

 

 

 

 

 

 

 

1.0

 

 

 

 

 

 

 

 

 

 

 

 

Depth

 

 

 

 

 

 

 

 

 

0.8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.6

 

 

 

 

 

 

 

 

 

.0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.8

1

 

 

 

 

 

 

 

 

 

 

 

 

0.4

 

 

 

 

 

 

 

 

 

 

 

 

[0.343 –1.126

0.784]

 

 

 

 

 

 

 

 

 

 

 

 

.6

–0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Pollution

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

0.2

 

 

 

 

4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

–1.2

–1.0

–0.8

–0.6

–0.4

–0.2

0

 

 

0.2

0.4

0.6

0.8

s2

 

 

 

 

 

 

 

 

 

O

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

76

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MEASURES OF DISTANCE AND CORRELATION BETWEEN VARIABLES

we turn our attention to the variables (the columns of the data matrix) and their sets of observed values across the I samples. To be able to visualize two variables in I-dimensional space, we choose I 3, since more than 3 is impossible to display or imagine. Exhibit 6.1(a) shows the variables depth and pollution according to the first three samples in Exhibit 1.1, with depth having values (i.e. coordinates)

72 75 59 and pollution 4.8 2.8 5.4 . Notice that it is the samples that now form the axes of the space. The much lower values for pollution compared to those for depth causes the distance between the variables to be dominated by this scale effect. Standardizing overcomes this effect – Exhibit 6.1(b) shows standardized values with respect to the mean and standard deviation of this sample of size 3 (hence the values here do not coincide with the standardized values in the complete data set, given in Exhibit 4.4). Exhibit 6.1(c) shows the two variables plotted according to these standardized values.

Exhibit 6.2 now shows the triangle formed by the two vectors in Exhibit 6.1(c) and the origin O, taken out of the three-dimensional space, and laid flat. From the coordinates of the points we can easily calculate the lengths of the three sides a, b and c of the triangle (where the sides a and b subtend the angle shown), so by using the cosine rule (c2 a2 b2 2ab cos( )), which we all learned at school) we can calculate the cosine of the angle between the vectors, which turns out to be

0.798, exactly the correlation between pollution and depth (the angle is 142.9º). Notice that this is the correlation calculated in this illustrative sample of size 3, not in the original sample of size 30, where the estimated correlation is 0.396.

Correlation coefficient as an angle cosine

 

Depth

Exhibit 6.2:

Pollution

c

Triangle of pollution and

depth vectors with respect

 

 

 

to origin (O) taken out of

 

 

Exhibit 6.1(c) and laid flat

a

b

 

θ

 

O

Hence we have illustrated the result that the cosine of the angle between two standardized variables, plotted as vectors in the space of dimensionality I, the number of samples, is their correlation coefficient.

But there is yet another way of interpreting the correlation coefficient geometrically. First we have to convert the standardized pollution and depth values to so-called unit variables. At present they are standardized to have variance 1, but a unit variable has sum of squares equal to 1 – in other words, its length is 1. Since the variance of I centred values is defined as 1/(I 1) times their sum

Correlation coefficient as a scalar product

77

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

of squares, it follows that the sum of squares equals (I 1) times the variance. By dividing the standardized values of pollution and depth in Exhibit 6.1(b) by

I 1, equal to 2 in this example, the standardized variables are converted to unit variables:

SITE

Depth

Pollution

 

 

 

s1

0.277

0.242

s2

0.527

−0.796

s3

−0.804

0.554

 

 

 

it can be checked that 0.2772 0.5272 ( 0.804)2 0.2422 ( 0.796)2 0.5542 1

The correlation coefficient then has the alternative definition as the sum of the products of the elements of the unit variables:

(0.242 0.277) ( 0.796 0.527) (0.554 ( 0.804)) 0.798

i.e., the scalar product:

I

 

rjj = xijxij

(6.1)

i = 1

where xij are the values of the unit variables.

The concept of a scalar product underlies many multivariate techniques which we shall introduce later. It is closely related to the operation of projection, which is crucial later when we project points in high-dimensional spaces onto lowerdimensional ones. As an illustration of this, consider Exhibit 6.3, which is the same as Exhibit 6.2 except that the sides a and b of the triangle are now shortened

Exhibit 6.3:

 

 

Depth

Same triangle as in Exhibit

Pollution

 

 

6.2, but with variables having

 

 

unit length (i.e., unit variables.

 

 

 

The projection of either

 

1

1

variable onto the direction

 

defined by the other variable

 

 

θ

 

 

 

vector will give the value of

 

 

O

the correlation, cos( ). (The

 

 

 

 

 

origin O is the zero point

 

 

 

– see Exhibit 6.1(c) – and

 

 

 

 

 

the scale is given by the unit

 

 

0.798

 

.798

 

length of the variables.)

 

 

0

 

 

 

 

78

MEASURES OF DISTANCE AND CORRELATION BETWEEN VARIABLES

to length 1 as unit variables (the subtended angle is still the same). The projection of either variable onto the axis defined by the other one gives the exact value of the correlation coefficient.

When variables are plotted in their unit form as in Exhibit 6.3, the squared distance between the variable points is computed (again using the cosine rule) as 1 1 2 cos( ) 2 2r, where r is the correlation. In general, therefore, a distance djj between variables j and j can be defined in terms of their correlation coefficient rjj as follows:

djj =

2 2rjj = 2 1 rjj

(6.2)

where djj has a minimum of 0

when r 1 (i.e., the two variables coincide), a

maximum of 2 when r 1 (i.e., the variables go in exact opposite directions), and djj = 2 when r 0 (i.e., the two variables are uncorrelated and are at rightangles to each other). For example, the distance between pollution and depth in Exhibit 6.3 is 2 1 (0.798) = 1.896.

An inter-variable distance can also be defined in the same way for other types of correlation coefficients and measures of association that lie between 1 and

1, for example the (Spearman) rank correlation. This so-called nonparametric measure of correlation is the regular correlation coefficient applied to the ranks of the data. In the sample of size 3 in Exhibit 6.1(a) pollution and depth have the following ranks:

Distances based on correlation coefficients

SITE

Depth

Pollution

 

 

 

s1

2

2

s2

3

1

s3

1

3

 

 

 

where, for example in the pollution column, the value 2.8 for site 2 is the lowest value, hence rank 1, then 4.8 is the next lowest value, hence rank 2, and 5.4 is the highest value, hence rank 3. The correlation between these two vectors is 1, since the ranks are indeed direct opposites – therefore, the distance between them based on the rank correlation is equal to 2, the maximum distance possible. Exhibit 6.4 shows the usual linear correlation coefficient, the Spearman rank correlation, and their associated distances, for the three variables based on their complete set of 30 sample values. This example confirms empirically that the results are more or less the same using ranks instead of the original values: that is, most of the correlation is in the ordering of the

79