Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

492

CHAPTER 21 Creating a package

To create a set of examples and datasets that can be distributed to students in a classroom

To create a program (a set of interrelated functions) that can be used to solve a significant analytic problem (such as imputing missing values)

Creating a useful package is also a great way of introducing yourself to others and giving back to the R community. Packages can be shared directly or through online repositories such as CRAN and GitHub.

In this chapter, you’ll have an opportunity to develop a package from start to finish. By the end of the chapter, you’ll be able to build your own R packages (and enjoy the smug self-satisfaction and bragging rights that attend such a feat).

The package you’ll develop is called npar. It provides functions for nonparametric group comparisons. This is a set of analytic techniques that can be used to compare two or more groups on an outcome variable that’s not normally distributed, or in situations where the variance of the outcome variable differs markedly across groups. This is a common problem facing analysts.

Before continuing, install the package using the following code:

pkg <- "npar_1.0.tar.gz"

loc <- "http://www.statmethods.net/RiA" url <- paste(loc, pkg, sep="/") download.file(url, pkg)

install.packages(pkg, repos=NULL, type="source")

This downloads the package from the statmethods.net website and saves it in your current working directory. It then installs the package in your default R library.

In section 21.1, you’ll take the npar package for a test drive. Its features and functions are described and demonstrated. Then in section 22.2, you’ll build the package from scratch.

21.1 Nonparametric analysis and the npar package

Nonparametric methods are a data-analytic approach that is particularly useful when the assumptions of traditional parametric methods (such as normality and constant variance) can’t be met. Here, we’ll focus on methods for comparing two or more independent groups on a numeric outcome variable.

Consider the life dataset that comes with the npar package. It contains the healthy life expectancy (HLE), or the estimated number of years of healthy living remaining, at age 65, for each American state from 2007 to 2009. Estimates are reported separately for men (hlem) and women (hlef). The HLE data were obtained from a Centers for Disease Control and Prevention publication (http://mng.bz/HTGD).

The dataset also contains a variable named region, dividing the states into Northeast, North Central, South, and West. I added this variable from the state.region data frame included in the base R installation.

Suppose you wanted to know whether HLE estimates for women vary significantly by region. One approach would be to use a one-way analysis of variance (ANOVA) as

Nonparametric analysis and the npar package

493

 

 

Distribution of Healthy Life Expectancy for Women

 

12

 

 

 

 

 

 

 

10

 

 

 

 

 

 

 

8

 

 

 

 

 

 

Frequency

6

 

 

 

 

 

 

 

4

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

11

12

13

14

15

16

17

 

 

 

Healthy Life Expectancy (years) at Age 65

 

Figure 21.1 Distribution of healthy life expectancies at age 65 for women in the United States (2007– 2009). The scores are negatively skewed (fewer scores at the low end).

described in chapter 9. But ANOVA assumes that the outcome variable is normally distributed and has a constant variance across each of the four country regions. Let’s examine both assumptions.

The distribution of HLE scores for women can be visualized using a histogram:

library(npar)

hist(life$hlef, xlab="Healthy Life Expectancy (years) at Age 65", main="Distribution of Healthy Life Expectancy for Women", col="grey", breaks=10)

The plot is displayed in figure 21.1. Clearly the outcome variable is negatively skewed, with fewer scores at the low end.

The variance of HLE scores across regions can be visualized using a side-by-side dot chart (see chapter 19 for details):

library(ggplot2)

ggplot(data=life, aes(x=region, y=hlef)) + geom_point(size=3, color="darkgrey") + labs(title="Distribution of HLE Estimates by Region",

x="US Region", y="Healthy Life Expectancy at Age 65") + theme_bw()

The results are displayed in figure 21.2, where each dot represents a state. Variances differ by region, with the greatest differences occurring between the Northeast and South.

494

CHAPTER 21 Creating a package

Distribution of HLE Estimates by Region

Healthy Life Expectancy at Age 65

16

14

12

North Central

Northeast

South

West

US Region

Figure 21.2 Dot chart of healthy life expectancies by region. The variability of HLE estimates differs across the four regions (compare the Northeast with the South).

Because the data violates two important ANOVA assumptions (normality and homogeneity of variance), you need a different approach. Unlike ANOVA, nonparametric methods don’t assume normality or equal variances. In the current case, you would only need to assume that the data are ordinal—that higher scores indicate greater healthy life expectancy. This makes a nonparametric approach a reasonable alternative for the current problem.

21.1.1Comparing groups with the npar package

You can use the npar package to compare independent groups on a numeric outcome variable that is at least ordinal. Given a numerical dependent variable and a categorical grouping variable, it provides

An omnibus Kruskal–Wallis test that the groups don’t differ.

Descriptive statistics for each group.

Post-hoc comparisons (Wilcoxon rank-sum tests) comparing groups two at a time. The test p-values can be adjusted to take multiple testing into account.

Annotated side-by-side box plots for visualizing group differences.

The following listing demonstrates use of the npar package with the HLE estimates by region for women.

Listing 21.1 Comparison of HLE estimates with the npar package

>library(npar)

>results <- oneway(hlef ~ region, life)

>summary(results)

Nonparametric analysis and the npar package

 

 

495

data: hlef on region

Overall test of

b

 

group differences

 

 

Omnibus Test

 

 

 

 

Kruskal-Wallis chi-squared = 17.8749, df = 3, p-value = 0.0004668

 

 

 

 

 

 

Descriptive Statistics

 

 

 

 

 

South North Central

West Northeast

c

n

16.000

12.00

13.0000

9.000

median 13.000

15.40

15.6000

15.700

 

mad

1.483

1.26

0.7413

0.593

 

Multiple Comparisons (Wilcoxon Rank Sum Tests)

Probability Adjustment = holm

 

 

 

 

Group.1

Group.2

W

p

 

1

South

North Central 28.0

0.008583

**

2

South

West 27.0 0.004738 **

3

South

Northeast 17.0 0.008583 **

4

North Central

West 63.5

1.000000

 

5

North Central

Northeast 42.0

1.000000

 

6

West

Northeast 54.5 1.000000

 

---

 

 

 

 

Signif. codes:

0 '***' 0.001

'**'

0.01 '*'

0.05 '.' 0.1

Summary statistic

dPairwise group comparisons

' ' 1

> plot(results,

col="lightblue", main="Multiple

Comparisons",

e Annotated

xlab="US

Region",

 

ylab="Healthy Life Expectancy (years) at

Age 65")

box plots

 

First, a Kruskal–Wallis test is performed b. This is an overall test of whether there are HLE differences between the regions. The small p-value (.00005) suggests that there are.

Next, descriptive statistics (sample size, median, and median absolute deviation) are provided for each region c. The HLE estimates are highest for the Northeast (median = 15.7 years) and lowest for the South (median = 13.0 years). The smallest variability among the states occurs in the Northeast (mad = 0.59), and the largest occurs in the South (mad = 1.48).

Although the Kruskal–Wallis test indicates that there are HLE differences among the regions, it doesn’t indicate where the differences lie. To determine this, you compare the groups two at a time using a Wilcoxon rank-sum test d. With four groups, there are 4 × (4 – 1) / 2 or 6 pairwise comparisons.

The difference between the South and the North Central regions is statistically significant (p = 0.009), whereas the difference between the Northeast and North Central regions isn’t (p = 1.0). In fact, the South differs from each of the other regions, but the other regions don’t differ from each other.

When computing multiple comparisons, you have to be concerned with alpha inflation: an increase in the probability of declaring groups to be significantly different when in fact they aren’t. For six independent comparisons, the chances of finding at least one erroneous difference by chance is 1 – (1 – .05)6 or 0.26.

With a chance of finding at least one false pairwise difference hovering around one in four, you’ll want to adjust the p-value for each comparison upward (make each test more stringent and less likely to declare a difference). Doing so keeps the overall

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]