Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

198

 

3

Studentized Residuals

0 1 2

 

−1

 

−2

CHAPTER 8 Regression

Influence Plot

Nevada

 

 

 

Alaska

 

 

 

Figure 8.16 Influence plot. States

 

 

 

above +2 or below –2 on the vertical

 

 

 

axis are considered outliers. States

 

 

 

above 0.2 or 0.3 on the horizontal

 

Washington

California

axis have high leverage (unusual

 

New York

combinations of predictor values).

 

 

 

 

 

Circle size is proportional to

 

Hawaii

 

influence. Observations depicted by

 

 

large circles may have

 

 

 

Rhode Island

 

 

disproportionate influence on the

 

 

parameter estimates of the model.

 

 

 

0.1

0.2

0.3

0.4

 

Hat−Values

 

 

Circle size is proportial to Cook’s Distance

 

The straight line in each plot is the actual regression coefficient for that predictor variable. You can see the impact of influential observations by imagining how the line would change if the point representing that observation was deleted. For example, look at the graph of Murder | Others versus Income | Others in the lower-left corner. You can see that eliminating the point labeled Alaska would move the line in a negative direction. In fact, deleting Alaska changes the regression coefficient for Income from positive (.00006) to negative (–.00085).

You can combine the information from outlier, leverage, and influence plots into one highly informative plot using the influencePlot() function from the car package:

library(car)

influencePlot(fit, id.method="identify", main="Influence Plot", sub="Circle size is proportional to Cook's distance")

The resulting plot (figure 8.16) shows that Nevada and Rhode Island are outliers; New York, California, Hawaii, and Washington have high leverage; and Nevada, Alaska, and Hawaii are influential observations.

8.5Corrective measures

Having spent the last 20 pages learning about regression diagnostics, you may ask, “What do you do if you identify problems?” There are four approaches to dealing with violations of regression assumptions:

Deleting observations

Transforming variables

Corrective measures

199

Adding or deleting variables

Using another regression approach

Let’s look at each in turn.

8.5.1Deleting observations

Deleting outliers can often improve a dataset’s fit to the normality assumption. Influential observations are often deleted as well, because they have an inordinate impact on the results. The largest outlier or influential observation is deleted, and the model is refit. If there are still outliers or influential observations, the process is repeated until an acceptable fit is obtained.

Again, I urge caution when considering the deletion of observations. Sometimes you can determine that the observation is an outlier because of data errors in recording, or because a protocol wasn’t followed, or because a test subject misunderstood instructions. In these cases, deleting the offending observation seems perfectly reasonable.

In other cases, the unusual observation may be the most interesting thing about the data you’ve collected. Uncovering why an observation differs from the rest can contribute great insight to the topic at hand and to other topics you might not have thought of. Some of our greatest advances have come from the serendipity of noticing that something doesn’t fit our preconceptions (pardon the hyperbole).

8.5.2Transforming variables

When models don’t meet the normality, linearity, or homoscedasticity assumptions,

transforming one or more variables can often improve or correct the situation. Transformations typically involve replacing a variable Y with Y λ. Common values of λ and

their interpretations are given in table 8.5. If Y is a proportion, a logit transformation [ln (Y/1-Y)] is often used.

Table 8.5 Common transformations

λ

-2

-1

-0.5

0

0.5

1

2

 

 

 

 

 

 

 

 

Transformation

1/Y2

1/Y

1/ Y

log(Y)

Y

None

Y2

When the model violates the normality assumption, you typically attempt a transformation of the response variable. You can use the powerTransform() function in the car package to generate a maximum-likelihood estimation of the power λ most likely to normalize the variable Xλ. In the next listing, this is applied to the states data.

Listing 8.10 Box–Cox transformation to normality

>library(car)

>summary(powerTransform(states$Murder)) bcPower Transformation to Normality

200

 

CHAPTER 8

Regression

 

 

Est.Power

Std.Err. Wald Lower Bound Wald Upper Bound

states$Murder

0.6

0.26

0.088

1.1

Likelihood ratio tests about transformation parameters

LRT df pval

LR test, lambda=(0) 5.7 1 0.017

LR test, lambda=(1) 2.1 1 0.145

The results suggest that you can normalize the variable Murder by replacing it with Murder0.6. Because 0.6 is close to 0.5, you could try a square-root transformation to improve the model’s fit to normality. But in this case, the hypothesis that λ=1 can’t be rejected (p = 0.145), so there’s no strong evidence that a transformation is needed in this case. This is consistent with the results of the Q-Q plot in figure 8.9.

When the assumption of linearity is violated, a transformation of the predictor variables can often help. The boxTidwell() function in the car package can be used to generate maximum-likelihood estimates of predictor powers that can improve linearity. An example of applying the Box–Tidwell transformations to a model that predicts state murder rates from their population and illiteracy rates follows:

>library(car)

>boxTidwell(Murder~Population+Illiteracy,data=states)

 

Score Statistic p-value MLE of lambda

Population

-0.32

0.75

0.87

Illiteracy

0.62

0.54

1.36

The results suggest trying the transformations Population.87 and Population1.36 to achieve greater linearity. But the score tests for Population (p = .75) and Illiteracy (p = .54) suggest that neither variable needs to be transformed. Again, these results are consistent with the component plus residual plots in figure 8.11.

Finally, transformations of the response variable can help in situations of heteroscedasticity (nonconstant error variance). You saw in listing 8.7 that the spreadLevelPlot() function in the car package offers a power transformation for improving homoscedasticity. Again, in the case of the states example, the constant error-variance assumption is met, and no transformation is necessary.

A caution concerning transformations

There’s an old joke in statistics: if you can’t prove A, prove B and pretend it was A. (For statisticians, that’s pretty funny.) The relevance here is that if you transform your variables, your interpretations must be based on the transformed variables, not the original variables. If the transformation makes sense, such as the log of income or the inverse of distance, the interpretation is easier. But how do you interpret the relationship between the frequency of suicidal ideation and the cube root of depression? If a transformation doesn’t make sense, you should avoid it.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]