Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

R in Action, Second Edition.pdf

Скачиваний:

540

Добавлен:

26.03.2016

Размер:

20.33 Mб

Скачать

☆

<<< < Предыдущая 55 56 57 58 59 60 61 62 63 64 65 6667 / 17367 68 69 70 71 72 73 74 75 76 77 78 79 > Следующая >>>

198

	3
Studentized Residuals	0 1 2
	−1
	−2

CHAPTER 8 Regression

Influence Plot

Nevada

			Alaska
			Figure 8.16 Influence plot. States
			above +2 or below –2 on the vertical
			axis are considered outliers. States
			above 0.2 or 0.3 on the horizontal
	Washington	California	axis have high leverage (unusual
	New York	California	combinations of predictor values).
	New York		combinations of predictor values).
			Circle size is proportional to
	Hawaii		influence. Observations depicted by
	Hawaii		large circles may have
			large circles may have
Rhode Island			disproportionate influence on the
Rhode Island			parameter estimates of the model.
			parameter estimates of the model.
0.1	0.2	0.3	0.4
	Hat−Values
Circle size is proportial to Cook’s Distance

The straight line in each plot is the actual regression coefficient for that predictor variable. You can see the impact of influential observations by imagining how the line would change if the point representing that observation was deleted. For example, look at the graph of Murder | Others versus Income | Others in the lower-left corner. You can see that eliminating the point labeled Alaska would move the line in a negative direction. In fact, deleting Alaska changes the regression coefficient for Income from positive (.00006) to negative (–.00085).

You can combine the information from outlier, leverage, and influence plots into one highly informative plot using the influencePlot() function from the car package:

library(car)

influencePlot(fit, id.method="identify", main="Influence Plot", sub="Circle size is proportional to Cook's distance")

The resulting plot (figure 8.16) shows that Nevada and Rhode Island are outliers; New York, California, Hawaii, and Washington have high leverage; and Nevada, Alaska, and Hawaii are influential observations.

8.5Corrective measures

Having spent the last 20 pages learning about regression diagnostics, you may ask, “What do you do if you identify problems?” There are four approaches to dealing with violations of regression assumptions:

■Deleting observations

■Transforming variables

Corrective measures

199

■Adding or deleting variables

■Using another regression approach

Let’s look at each in turn.

8.5.1Deleting observations

Deleting outliers can often improve a dataset’s fit to the normality assumption. Influential observations are often deleted as well, because they have an inordinate impact on the results. The largest outlier or influential observation is deleted, and the model is refit. If there are still outliers or influential observations, the process is repeated until an acceptable fit is obtained.

Again, I urge caution when considering the deletion of observations. Sometimes you can determine that the observation is an outlier because of data errors in recording, or because a protocol wasn’t followed, or because a test subject misunderstood instructions. In these cases, deleting the offending observation seems perfectly reasonable.

In other cases, the unusual observation may be the most interesting thing about the data you’ve collected. Uncovering why an observation differs from the rest can contribute great insight to the topic at hand and to other topics you might not have thought of. Some of our greatest advances have come from the serendipity of noticing that something doesn’t fit our preconceptions (pardon the hyperbole).

8.5.2Transforming variables

When models don’t meet the normality, linearity, or homoscedasticity assumptions,

transforming one or more variables can often improve or correct the situation. Transformations typically involve replacing a variable Y with Y λ. Common values of λ and

their interpretations are given in table 8.5. If Y is a proportion, a logit transformation [ln (Y/1-Y)] is often used.

Table 8.5 Common transformations

λ	-2	-1	-0.5	0	0.5	1	2

Transformation	1/Y2	1/Y	1/ Y	log(Y)	Y	None	Y2

When the model violates the normality assumption, you typically attempt a transformation of the response variable. You can use the powerTransform() function in the car package to generate a maximum-likelihood estimation of the power λ most likely to normalize the variable Xλ. In the next listing, this is applied to the states data.

Listing 8.10 Box–Cox transformation to normality

>library(car)

>summary(powerTransform(states$Murder)) bcPower Transformation to Normality

200		CHAPTER 8	Regression
	Est.Power	Std.Err. Wald Lower Bound Wald Upper Bound
states$Murder	0.6	0.26	0.088	1.1

Likelihood ratio tests about transformation parameters

LRT df pval

LR test, lambda=(0) 5.7 1 0.017

LR test, lambda=(1) 2.1 1 0.145

The results suggest that you can normalize the variable Murder by replacing it with Murder0.6. Because 0.6 is close to 0.5, you could try a square-root transformation to improve the model’s fit to normality. But in this case, the hypothesis that λ=1 can’t be rejected (p = 0.145), so there’s no strong evidence that a transformation is needed in this case. This is consistent with the results of the Q-Q plot in figure 8.9.

When the assumption of linearity is violated, a transformation of the predictor variables can often help. The boxTidwell() function in the car package can be used to generate maximum-likelihood estimates of predictor powers that can improve linearity. An example of applying the Box–Tidwell transformations to a model that predicts state murder rates from their population and illiteracy rates follows:

>library(car)

>boxTidwell(Murder~Population+Illiteracy,data=states)

	Score Statistic p-value MLE of lambda
Population	-0.32	0.75	0.87
Illiteracy	0.62	0.54	1.36

The results suggest trying the transformations Population.87 and Population1.36 to achieve greater linearity. But the score tests for Population (p = .75) and Illiteracy (p = .54) suggest that neither variable needs to be transformed. Again, these results are consistent with the component plus residual plots in figure 8.11.

Finally, transformations of the response variable can help in situations of heteroscedasticity (nonconstant error variance). You saw in listing 8.7 that the spreadLevelPlot() function in the car package offers a power transformation for improving homoscedasticity. Again, in the case of the states example, the constant error-variance assumption is met, and no transformation is necessary.

A caution concerning transformations

There’s an old joke in statistics: if you can’t prove A, prove B and pretend it was A. (For statisticians, that’s pretty funny.) The relevance here is that if you transform your variables, your interpretations must be based on the transformed variables, not the original variables. If the transformation makes sense, such as the log of income or the inverse of distance, the interpretation is easier. But how do you interpret the relationship between the frequency of suicidal ideation and the cube root of depression? If a transformation doesn’t make sense, you should avoid it.

<<< < Предыдущая 55 56 57 58 59 60 61 62 63 64 65 6667 / 17367 68 69 70 71 72 73 74 75 76 77 78 79 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
05.08.2019741.83 Кб1psihologia.rtf
#
02.06.2015162.69 Кб77Psyh_final_ver.docx
#
02.06.2015141.74 Кб44Psyh_final_ver.docx
#
26.03.2016226.3 Кб23public_corporation.doc
#
26.03.2016451.53 Кб7pud_finansovyy-menedjment_318476.pdf
#
26.03.201620.33 Mб540R in Action, Second Edition.pdf
#
26.03.2016296.21 Кб17Radaev_Kak_napisat_akademicheskiy_text.pdf
#
26.03.20163.76 Mб4Raeff_Modernity.pdf
#
26.03.20162.12 Mб19raigorodskii_d_ya_hrestomatiya_psihologiya_lich.pdf
#
02.06.2015494.59 Кб6raschet_SRK_smorodin.doc
#
02.06.201563.98 Кб4referat_IOGP_3.docx