Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

Advanced graphics with ggplot2

This chapter covers

An introduction to the ggplot2 package

Using shape, color, and size to visualize multivariate data

Comparing groups with faceted graphs

Customizing ggplot2 plots

In previous chapters, you created a wide variety of general and specialized graphs (and had lots of fun in the process). Most were produced using R’s base graphics system. Given the diversity of methods available in R, it may not surprise you to learn that four separate and complete graphics systems are currently available.

In addition to base graphics, graphics systems are provided by the grid, lattice, and ggplot2 packages. Each is designed to expand on the capabilities of, and correct for deficiencies in, R’s base graphics system.

The grid graphics system provides low-level access to graphic primitives, giving programmers a great deal of flexibility in the creation of graphic output. The lattice package provides an intuitive approach for examining multivariate

437

438

CHAPTER 19 Advanced graphics with ggplot2

relationships through conditional one-, two-, or three-dimensional graphs called trellis graphs. The ggplot2 package provides a method of creating innovative graphs based on a comprehensive graphical “grammar.”

In this chapter, we’ll start with a brief overview of the four graphic systems. Then we’ll focus on graphs that can be generated with the ggplot2 package. ggplot2 greatly expands the range and quality of the graphs you can produce in R. It allows you to visualize datasets with many variables using a comprehensive and consistent syntax, and easily generate graphs that would be difficult to create using base R graphics.

19.1 The four graphics systems in R

As stated earlier, there are currently four graphical systems available in R. The base graphics system, written by Ross Ihaka, is included in every R installation. Most of the graphs produced in previous chapters rely on base graphics functions.

The grid graphics system, written by Paul Murrell (2011), is implemented through the grid package. grid graphics offer a lower-level alternative to the standard graphics system. The user can create arbitrary rectangular regions on graphics devices, define coordinate systems for each region, and use a rich set of drawing primitives to control the arrangement and appearance of graphic elements.

This flexibility makes grid graphics a valuable tool for software developers. But the grid package doesn’t provide functions for producing statistical graphics or complete plots. Because of this, the package is rarely used directly by data analysts and won’t be discussed further. If you’re interested in learning more about grid, visit Dr. Murrell’s Grid website (http://mng.bz/C86p) for details.

The lattice package, written by Deepayan Sarkar (2008), implements trellis graphs, as outlined by Cleveland (1985, 1993). Basically, trellis graphs display the distribution of a variable or the relationship between variables, separately for each level of one or more other variables. Built using the grid package, the lattice package has grown beyond Cleveland’s original approach to visualizing multivariate data and now provides a comprehensive alternative system for creating statistical graphics in R. A number of packages described in this book (effects, flexclust, Hmisc, mice, and odfWeave) use functions in the lattice package to produce graphs.

Finally, the ggplot2 package, written by Hadley Wickham (2009a), provides a system for creating graphs based on the grammar of graphics described by Wilkinson (2005) and expanded by Wickham (2009b). The intention of the ggplot2 package is to provide a comprehensive, grammar-based system for generating graphs in a unified and coherent manner, allowing users to create new and innovative data visualizations. The power of this approach has led ggplot2 to become an important tool for visualizing data using R.

Access to the four systems differs, as outlined in table 19.1. Base graphic functions are automatically available. To access grid and lattice functions, you must load the appropriate package explicitly (for example, library(lattice)). To access ggplot2 functions, you have to download and install the package (install.packages ("ggplot2")) before first use and then load it (library(ggplot2)).

 

An introduction to the ggplot2 package

439

Table 19.1 Access to graphic systems

 

 

 

 

 

 

System

Included in base installation?

Must be explicitly loaded?

 

 

 

 

 

base

Yes

No

 

grid

Yes

Yes

 

lattice

Yes

Yes

 

ggplot2

No

Yes

 

 

 

 

 

The lattice and ggplot2 packages overlap in functionality but approach the creation of graphs differently. Analysts tend to rely on one package or the other when plotting multivariate data. Given its power and popularity, the remainder of this chapter will focus on ggplot2. If you would like to learn more about the lattice package, I’ve created a supplementary chapter that you can download (www.statmethods

.net/RiA/lattice.pdf) or from the publisher’s website at www.manning.com/ RinActionSecondEdition.

This chapter uses three datasets to illustrate the use of ggplot2. The first is a data frame called singer that comes from the lattice package; it contains the heights and voice parts of singers in the New York Choral Society. The second is the mtcars data frame that you’ve used throughout this book; it contains automotive details on 32 automobiles. The final data frame is called Salaries and is included with the car package described in chapter 8. Salaries contains salary information for university professors and was collected to explore gender discrepancies in income. Together, these datasets offer a variety of visualization challenges.

Before continuing, be sure the ggplot2 and car packages are installed. You’ll also want to install the gridExtra package. It allows you to place multiple ggplot2 graphs into a single plot (see section 19.7.4).

19.2 An introduction to the ggplot2 package

The ggplot2 package implements a system for creating graphics in R based on a comprehensive and coherent grammar. This provides a consistency to graph creation that’s often lacking in R and allows you to create graph types that are innovative and novel. In this section, we’ll start with an overview of ggplot2 grammar; subsequent sections dive into the details.

In ggplot2, plots are created by chaining together functions using the plus (+) sign. Each function modifies the plot created up to that point. It’s easiest to see with an example (the graph is given in figure 19.1):

library(ggplot2)

ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point() +

labs(title="Automobile Data", x="Weight", y="Miles Per Gallon")

440

CHAPTER 19 Advanced graphics with ggplot2

Miles Per Gallon

Automobile Data

35

30

25

20

15

Figure 19.1 Scatterplot of

10

automobile weight by mileage

2

3

4

5

Weight

Let’s break down how the plot was produced. The ggplot() function initializes the plot and specifies the data source (mtcars) and variables (wt, mpg) to be used. The options in the aes() function specify what role each variable will play. (aes stands for aesthetics, or how information is represented visually.) Here, the wt values are mapped to distances along the x-axis, and mpg values are mapped to distances along the y-axis.

The ggplot() function sets up the graph but produces no visual output on its own. Geometric objects (called geoms for short), which include points, lines, bars, box plots, and shaded regions, are added to the graph using one or more geom functions. In this example, the geom_point() function draws points on the graph, creating a scatterplot. The labs() function is optional and adds annotations (axis labels and a title).

There are many functions in ggplot2, and most can include optional parameters. Expanding on the previous example, the code

library(ggplot2)

ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point(pch=17, color="blue", size=2) + geom_smooth(method="lm", color="red", linetype=2) +

labs(title="Automobile Data", x="Weight", y="Miles Per Gallon")

produces the graph in figure 19.2.

Options to geom_point() set the point shape to triangles (pch=17), double the points’ size (size=2), and render them in blue (color="blue"). The geom_smooth() function adds a “smoothed” line. Here a linear fit is requested (method="lm") and a red

An introduction to the ggplot2 package

441

Automobile Data

 

30

 

 

 

Gallon

 

 

 

 

Miles Per

20

 

 

 

 

 

 

 

 

10

 

 

 

 

2

3

4

5

 

 

 

Weight

 

Figure 19.2 Scatterplot of automobile weight by gas mileage, with a superimposed line of best fit and 95% confidence region

(color="red") dashed (linetype=2) line of size 1 (size=1) is produced. By default, the line includes 95% confidence intervals (the darker band). We’ll go into more detail about modeling relationships with linear and nonlinear fits in section 19.6.

The ggplot2 package provides methods for grouping and faceting. Grouping displays two or more groups of observations in a single plot. Groups are usually differentiated by color, shape, or shading. Faceting displays groups of observations in separate, side-by-side plots. The ggplot2 package uses factors when defining groups or facets.

You can see both grouping and faceting with the mtcars data frame. First, transform the am, vs, and cyl variables into factors:

mtcars$am <- factor(mtcars$am, levels=c(0,1), labels=c("Automatic", "Manual"))

mtcars$vs <- factor(mtcars$vs, levels=c(0,1), labels=c("V-Engine", "Straight Engine"))

mtcars$cyl <- factor(mtcars$cyl)

Next, generate a plot using the following code:

library(ggplot2) ggplot(data=mtcars, aes(x=hp, y=mpg,

shape=cyl, color=cyl)) + geom_point(size=3) + facet_grid(am~vs) +

labs(title="Automobile Data by Engine Type", x="Horsepower", y="Miles Per Gallon")

442

CHAPTER 19 Advanced graphics with ggplot2

Miles Per Gallon

35

30

25

20

15

10

35

30

25

20

15

10

Automobile Data by Engine Type

V−Engine

 

Straight Engine

Automatic

cyl

4

6

8

Manual

 

 

 

 

 

 

 

 

100

200

300

100

200

300

Horsepower

Figure 19.3 A scatterplot showing the relationship between horsepower and gas mileage separately for transmission and engine type. The number of cylinders in each automobile engine is represented by both shape and color.

The resulting graph (see figure 19.3) contains separate scatterplots for each combination of transmission type (automatic vs. manual) and engine arrangement (V-engine vs. straight engine). The color and shape of each point indicates the number of engine cylinders in that car. In this case, am and vs are the faceting variables, and cyl is the grouping variable.

The ggplot2 package is powerful and can be used to create a wide array of informative graphs. It’s popular among seasoned R analysts and programmers; and, based on postings in R blogs and discussion groups, that popularity is growing.

Unfortunately, with power comes complexity. Unlike other R packages, ggplot2 can be thought of as a comprehensive graphical programming language in its own right. It has its own learning curve, and at times that curve can be steep. Hang in there—the effort is worth it. Luckily, there are function defaults and language simplifications designed to make your introduction to this package easier. With practice,

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]