Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

R in Action, Second Edition.pdf

Скачиваний:

540

Добавлен:

26.03.2016

Размер:

20.33 Mб

Скачать

☆

<<< < Предыдущая 138 139 140 141 142 143 144 145 146 147 148 149150 / 173150 151 152 153 154 155 156 157 158 159 160 161 162 > Следующая >>>

492	CHAPTER 21 Creating a package

■To create a set of examples and datasets that can be distributed to students in a classroom

■To create a program (a set of interrelated functions) that can be used to solve a significant analytic problem (such as imputing missing values)

Creating a useful package is also a great way of introducing yourself to others and giving back to the R community. Packages can be shared directly or through online repositories such as CRAN and GitHub.

In this chapter, you’ll have an opportunity to develop a package from start to finish. By the end of the chapter, you’ll be able to build your own R packages (and enjoy the smug self-satisfaction and bragging rights that attend such a feat).

The package you’ll develop is called npar. It provides functions for nonparametric group comparisons. This is a set of analytic techniques that can be used to compare two or more groups on an outcome variable that’s not normally distributed, or in situations where the variance of the outcome variable differs markedly across groups. This is a common problem facing analysts.

Before continuing, install the package using the following code:

pkg <- "npar_1.0.tar.gz"

loc <- "http://www.statmethods.net/RiA" url <- paste(loc, pkg, sep="/") download.file(url, pkg)

install.packages(pkg, repos=NULL, type="source")

This downloads the package from the statmethods.net website and saves it in your current working directory. It then installs the package in your default R library.

In section 21.1, you’ll take the npar package for a test drive. Its features and functions are described and demonstrated. Then in section 22.2, you’ll build the package from scratch.

21.1 Nonparametric analysis and the npar package

Nonparametric methods are a data-analytic approach that is particularly useful when the assumptions of traditional parametric methods (such as normality and constant variance) can’t be met. Here, we’ll focus on methods for comparing two or more independent groups on a numeric outcome variable.

Consider the life dataset that comes with the npar package. It contains the healthy life expectancy (HLE), or the estimated number of years of healthy living remaining, at age 65, for each American state from 2007 to 2009. Estimates are reported separately for men (hlem) and women (hlef). The HLE data were obtained from a Centers for Disease Control and Prevention publication (http://mng.bz/HTGD).

The dataset also contains a variable named region, dividing the states into Northeast, North Central, South, and West. I added this variable from the state.region data frame included in the base R installation.

Suppose you wanted to know whether HLE estimates for women vary significantly by region. One approach would be to use a one-way analysis of variance (ANOVA) as

Nonparametric analysis and the npar package

493

		Distribution of Healthy Life Expectancy for Women
	12
	10
	8
Frequency	6
	4
	2
	0
	11	12	13	14	15	16	17
			Healthy Life Expectancy (years) at Age 65

Figure 21.1 Distribution of healthy life expectancies at age 65 for women in the United States (2007– 2009). The scores are negatively skewed (fewer scores at the low end).

described in chapter 9. But ANOVA assumes that the outcome variable is normally distributed and has a constant variance across each of the four country regions. Let’s examine both assumptions.

The distribution of HLE scores for women can be visualized using a histogram:

library(npar)

hist(life$hlef, xlab="Healthy Life Expectancy (years) at Age 65", main="Distribution of Healthy Life Expectancy for Women", col="grey", breaks=10)

The plot is displayed in figure 21.1. Clearly the outcome variable is negatively skewed, with fewer scores at the low end.

The variance of HLE scores across regions can be visualized using a side-by-side dot chart (see chapter 19 for details):

library(ggplot2)

ggplot(data=life, aes(x=region, y=hlef)) + geom_point(size=3, color="darkgrey") + labs(title="Distribution of HLE Estimates by Region",

x="US Region", y="Healthy Life Expectancy at Age 65") + theme_bw()

The results are displayed in figure 21.2, where each dot represents a state. Variances differ by region, with the greatest differences occurring between the Northeast and South.

494	CHAPTER 21 Creating a package

Distribution of HLE Estimates by Region

Healthy Life Expectancy at Age 65

North Central

Northeast

South

West

US Region

Figure 21.2 Dot chart of healthy life expectancies by region. The variability of HLE estimates differs across the four regions (compare the Northeast with the South).

Because the data violates two important ANOVA assumptions (normality and homogeneity of variance), you need a different approach. Unlike ANOVA, nonparametric methods don’t assume normality or equal variances. In the current case, you would only need to assume that the data are ordinal—that higher scores indicate greater healthy life expectancy. This makes a nonparametric approach a reasonable alternative for the current problem.

21.1.1Comparing groups with the npar package

You can use the npar package to compare independent groups on a numeric outcome variable that is at least ordinal. Given a numerical dependent variable and a categorical grouping variable, it provides

■An omnibus Kruskal–Wallis test that the groups don’t differ.

■Descriptive statistics for each group.

■Post-hoc comparisons (Wilcoxon rank-sum tests) comparing groups two at a time. The test p-values can be adjusted to take multiple testing into account.

■Annotated side-by-side box plots for visualizing group differences.

The following listing demonstrates use of the npar package with the HLE estimates by region for women.

Listing 21.1 Comparison of HLE estimates with the npar package

>library(npar)

>results <- oneway(hlef ~ region, life)

>summary(results)

Nonparametric analysis and the npar package		495
data: hlef on region	Overall test of	b
	group differences
Omnibus Test
Kruskal-Wallis chi-squared = 17.8749, df = 3, p-value = 0.0004668

Descriptive Statistics
	South North Central		West Northeast		c
n	16.000	12.00	13.0000	9.000	c
median 13.000		15.40	15.6000	15.700
mad	1.483	1.26	0.7413	0.593

Multiple Comparisons (Wilcoxon Rank Sum Tests)

Probability Adjustment = holm
	Group.1	Group.2	W	p
1	South	North Central 28.0		0.008583	**
2	South	West 27.0 0.004738 **
3	South	Northeast 17.0 0.008583 **
4	North Central	West 63.5		1.000000
5	North Central	Northeast 42.0		1.000000
6	West	Northeast 54.5 1.000000
---
Signif. codes:		0 '***' 0.001	'**'	0.01 '*'	0.05 '.' 0.1

Summary statistic

dPairwise group comparisons

' ' 1

> plot(results,	col="lightblue", main="Multiple	Comparisons",	e Annotated
xlab="US	Region",		e Annotated
ylab="Healthy Life Expectancy (years) at		Age 65")	box plots
ylab="Healthy Life Expectancy (years) at		Age 65")

First, a Kruskal–Wallis test is performed b. This is an overall test of whether there are HLE differences between the regions. The small p-value (.00005) suggests that there are.

Next, descriptive statistics (sample size, median, and median absolute deviation) are provided for each region c. The HLE estimates are highest for the Northeast (median = 15.7 years) and lowest for the South (median = 13.0 years). The smallest variability among the states occurs in the Northeast (mad = 0.59), and the largest occurs in the South (mad = 1.48).

Although the Kruskal–Wallis test indicates that there are HLE differences among the regions, it doesn’t indicate where the differences lie. To determine this, you compare the groups two at a time using a Wilcoxon rank-sum test d. With four groups, there are 4 × (4 – 1) / 2 or 6 pairwise comparisons.

The difference between the South and the North Central regions is statistically significant (p = 0.009), whereas the difference between the Northeast and North Central regions isn’t (p = 1.0). In fact, the South differs from each of the other regions, but the other regions don’t differ from each other.

When computing multiple comparisons, you have to be concerned with alpha inflation: an increase in the probability of declaring groups to be significantly different when in fact they aren’t. For six independent comparisons, the chances of finding at least one erroneous difference by chance is 1 – (1 – .05)6 or 0.26.

With a chance of finding at least one false pairwise difference hovering around one in four, you’ll want to adjust the p-value for each comparison upward (make each test more stringent and less likely to declare a difference). Doing so keeps the overall

<<< < Предыдущая 138 139 140 141 142 143 144 145 146 147 148 149150 / 173150 151 152 153 154 155 156 157 158 159 160 161 162 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
05.08.2019741.83 Кб1psihologia.rtf
#
02.06.2015162.69 Кб77Psyh_final_ver.docx
#
02.06.2015141.74 Кб44Psyh_final_ver.docx
#
26.03.2016226.3 Кб23public_corporation.doc
#
26.03.2016451.53 Кб7pud_finansovyy-menedjment_318476.pdf
#
26.03.201620.33 Mб540R in Action, Second Edition.pdf
#
26.03.2016296.21 Кб17Radaev_Kak_napisat_akademicheskiy_text.pdf
#
26.03.20163.76 Mб4Raeff_Modernity.pdf
#
26.03.20162.12 Mб19raigorodskii_d_ya_hrestomatiya_psihologiya_lich.pdf
#
02.06.2015494.59 Кб6raschet_SRK_smorodin.doc
#
02.06.201563.98 Кб4referat_IOGP_3.docx