- •brief contents
- •contents
- •preface
- •acknowledgments
- •about this book
- •What’s new in the second edition
- •Who should read this book
- •Roadmap
- •Advice for data miners
- •Code examples
- •Code conventions
- •Author Online
- •About the author
- •about the cover illustration
- •1 Introduction to R
- •1.2 Obtaining and installing R
- •1.3 Working with R
- •1.3.1 Getting started
- •1.3.2 Getting help
- •1.3.3 The workspace
- •1.3.4 Input and output
- •1.4 Packages
- •1.4.1 What are packages?
- •1.4.2 Installing a package
- •1.4.3 Loading a package
- •1.4.4 Learning about a package
- •1.5 Batch processing
- •1.6 Using output as input: reusing results
- •1.7 Working with large datasets
- •1.8 Working through an example
- •1.9 Summary
- •2 Creating a dataset
- •2.1 Understanding datasets
- •2.2 Data structures
- •2.2.1 Vectors
- •2.2.2 Matrices
- •2.2.3 Arrays
- •2.2.4 Data frames
- •2.2.5 Factors
- •2.2.6 Lists
- •2.3 Data input
- •2.3.1 Entering data from the keyboard
- •2.3.2 Importing data from a delimited text file
- •2.3.3 Importing data from Excel
- •2.3.4 Importing data from XML
- •2.3.5 Importing data from the web
- •2.3.6 Importing data from SPSS
- •2.3.7 Importing data from SAS
- •2.3.8 Importing data from Stata
- •2.3.9 Importing data from NetCDF
- •2.3.10 Importing data from HDF5
- •2.3.11 Accessing database management systems (DBMSs)
- •2.3.12 Importing data via Stat/Transfer
- •2.4 Annotating datasets
- •2.4.1 Variable labels
- •2.4.2 Value labels
- •2.5 Useful functions for working with data objects
- •2.6 Summary
- •3 Getting started with graphs
- •3.1 Working with graphs
- •3.2 A simple example
- •3.3 Graphical parameters
- •3.3.1 Symbols and lines
- •3.3.2 Colors
- •3.3.3 Text characteristics
- •3.3.4 Graph and margin dimensions
- •3.4 Adding text, customized axes, and legends
- •3.4.1 Titles
- •3.4.2 Axes
- •3.4.3 Reference lines
- •3.4.4 Legend
- •3.4.5 Text annotations
- •3.4.6 Math annotations
- •3.5 Combining graphs
- •3.5.1 Creating a figure arrangement with fine control
- •3.6 Summary
- •4 Basic data management
- •4.1 A working example
- •4.2 Creating new variables
- •4.3 Recoding variables
- •4.4 Renaming variables
- •4.5 Missing values
- •4.5.1 Recoding values to missing
- •4.5.2 Excluding missing values from analyses
- •4.6 Date values
- •4.6.1 Converting dates to character variables
- •4.6.2 Going further
- •4.7 Type conversions
- •4.8 Sorting data
- •4.9 Merging datasets
- •4.9.1 Adding columns to a data frame
- •4.9.2 Adding rows to a data frame
- •4.10 Subsetting datasets
- •4.10.1 Selecting (keeping) variables
- •4.10.2 Excluding (dropping) variables
- •4.10.3 Selecting observations
- •4.10.4 The subset() function
- •4.10.5 Random samples
- •4.11 Using SQL statements to manipulate data frames
- •4.12 Summary
- •5 Advanced data management
- •5.2 Numerical and character functions
- •5.2.1 Mathematical functions
- •5.2.2 Statistical functions
- •5.2.3 Probability functions
- •5.2.4 Character functions
- •5.2.5 Other useful functions
- •5.2.6 Applying functions to matrices and data frames
- •5.3 A solution for the data-management challenge
- •5.4 Control flow
- •5.4.1 Repetition and looping
- •5.4.2 Conditional execution
- •5.5 User-written functions
- •5.6 Aggregation and reshaping
- •5.6.1 Transpose
- •5.6.2 Aggregating data
- •5.6.3 The reshape2 package
- •5.7 Summary
- •6 Basic graphs
- •6.1 Bar plots
- •6.1.1 Simple bar plots
- •6.1.2 Stacked and grouped bar plots
- •6.1.3 Mean bar plots
- •6.1.4 Tweaking bar plots
- •6.1.5 Spinograms
- •6.2 Pie charts
- •6.3 Histograms
- •6.4 Kernel density plots
- •6.5 Box plots
- •6.5.1 Using parallel box plots to compare groups
- •6.5.2 Violin plots
- •6.6 Dot plots
- •6.7 Summary
- •7 Basic statistics
- •7.1 Descriptive statistics
- •7.1.1 A menagerie of methods
- •7.1.2 Even more methods
- •7.1.3 Descriptive statistics by group
- •7.1.4 Additional methods by group
- •7.1.5 Visualizing results
- •7.2 Frequency and contingency tables
- •7.2.1 Generating frequency tables
- •7.2.2 Tests of independence
- •7.2.3 Measures of association
- •7.2.4 Visualizing results
- •7.3 Correlations
- •7.3.1 Types of correlations
- •7.3.2 Testing correlations for significance
- •7.3.3 Visualizing correlations
- •7.4 T-tests
- •7.4.3 When there are more than two groups
- •7.5 Nonparametric tests of group differences
- •7.5.1 Comparing two groups
- •7.5.2 Comparing more than two groups
- •7.6 Visualizing group differences
- •7.7 Summary
- •8 Regression
- •8.1 The many faces of regression
- •8.1.1 Scenarios for using OLS regression
- •8.1.2 What you need to know
- •8.2 OLS regression
- •8.2.1 Fitting regression models with lm()
- •8.2.2 Simple linear regression
- •8.2.3 Polynomial regression
- •8.2.4 Multiple linear regression
- •8.2.5 Multiple linear regression with interactions
- •8.3 Regression diagnostics
- •8.3.1 A typical approach
- •8.3.2 An enhanced approach
- •8.3.3 Global validation of linear model assumption
- •8.3.4 Multicollinearity
- •8.4 Unusual observations
- •8.4.1 Outliers
- •8.4.3 Influential observations
- •8.5 Corrective measures
- •8.5.1 Deleting observations
- •8.5.2 Transforming variables
- •8.5.3 Adding or deleting variables
- •8.5.4 Trying a different approach
- •8.6 Selecting the “best” regression model
- •8.6.1 Comparing models
- •8.6.2 Variable selection
- •8.7 Taking the analysis further
- •8.7.1 Cross-validation
- •8.7.2 Relative importance
- •8.8 Summary
- •9 Analysis of variance
- •9.1 A crash course on terminology
- •9.2 Fitting ANOVA models
- •9.2.1 The aov() function
- •9.2.2 The order of formula terms
- •9.3.1 Multiple comparisons
- •9.3.2 Assessing test assumptions
- •9.4 One-way ANCOVA
- •9.4.1 Assessing test assumptions
- •9.4.2 Visualizing the results
- •9.6 Repeated measures ANOVA
- •9.7 Multivariate analysis of variance (MANOVA)
- •9.7.1 Assessing test assumptions
- •9.7.2 Robust MANOVA
- •9.8 ANOVA as regression
- •9.9 Summary
- •10 Power analysis
- •10.1 A quick review of hypothesis testing
- •10.2 Implementing power analysis with the pwr package
- •10.2.1 t-tests
- •10.2.2 ANOVA
- •10.2.3 Correlations
- •10.2.4 Linear models
- •10.2.5 Tests of proportions
- •10.2.7 Choosing an appropriate effect size in novel situations
- •10.3 Creating power analysis plots
- •10.4 Other packages
- •10.5 Summary
- •11 Intermediate graphs
- •11.1 Scatter plots
- •11.1.3 3D scatter plots
- •11.1.4 Spinning 3D scatter plots
- •11.1.5 Bubble plots
- •11.2 Line charts
- •11.3 Corrgrams
- •11.4 Mosaic plots
- •11.5 Summary
- •12 Resampling statistics and bootstrapping
- •12.1 Permutation tests
- •12.2 Permutation tests with the coin package
- •12.2.2 Independence in contingency tables
- •12.2.3 Independence between numeric variables
- •12.2.5 Going further
- •12.3 Permutation tests with the lmPerm package
- •12.3.1 Simple and polynomial regression
- •12.3.2 Multiple regression
- •12.4 Additional comments on permutation tests
- •12.5 Bootstrapping
- •12.6 Bootstrapping with the boot package
- •12.6.1 Bootstrapping a single statistic
- •12.6.2 Bootstrapping several statistics
- •12.7 Summary
- •13 Generalized linear models
- •13.1 Generalized linear models and the glm() function
- •13.1.1 The glm() function
- •13.1.2 Supporting functions
- •13.1.3 Model fit and regression diagnostics
- •13.2 Logistic regression
- •13.2.1 Interpreting the model parameters
- •13.2.2 Assessing the impact of predictors on the probability of an outcome
- •13.2.3 Overdispersion
- •13.2.4 Extensions
- •13.3 Poisson regression
- •13.3.1 Interpreting the model parameters
- •13.3.2 Overdispersion
- •13.3.3 Extensions
- •13.4 Summary
- •14 Principal components and factor analysis
- •14.1 Principal components and factor analysis in R
- •14.2 Principal components
- •14.2.1 Selecting the number of components to extract
- •14.2.2 Extracting principal components
- •14.2.3 Rotating principal components
- •14.2.4 Obtaining principal components scores
- •14.3 Exploratory factor analysis
- •14.3.1 Deciding how many common factors to extract
- •14.3.2 Extracting common factors
- •14.3.3 Rotating factors
- •14.3.4 Factor scores
- •14.4 Other latent variable models
- •14.5 Summary
- •15 Time series
- •15.1 Creating a time-series object in R
- •15.2 Smoothing and seasonal decomposition
- •15.2.1 Smoothing with simple moving averages
- •15.2.2 Seasonal decomposition
- •15.3 Exponential forecasting models
- •15.3.1 Simple exponential smoothing
- •15.3.3 The ets() function and automated forecasting
- •15.4 ARIMA forecasting models
- •15.4.1 Prerequisite concepts
- •15.4.2 ARMA and ARIMA models
- •15.4.3 Automated ARIMA forecasting
- •15.5 Going further
- •15.6 Summary
- •16 Cluster analysis
- •16.1 Common steps in cluster analysis
- •16.2 Calculating distances
- •16.3 Hierarchical cluster analysis
- •16.4 Partitioning cluster analysis
- •16.4.2 Partitioning around medoids
- •16.5 Avoiding nonexistent clusters
- •16.6 Summary
- •17 Classification
- •17.1 Preparing the data
- •17.2 Logistic regression
- •17.3 Decision trees
- •17.3.1 Classical decision trees
- •17.3.2 Conditional inference trees
- •17.4 Random forests
- •17.5 Support vector machines
- •17.5.1 Tuning an SVM
- •17.6 Choosing a best predictive solution
- •17.7 Using the rattle package for data mining
- •17.8 Summary
- •18 Advanced methods for missing data
- •18.1 Steps in dealing with missing data
- •18.2 Identifying missing values
- •18.3 Exploring missing-values patterns
- •18.3.1 Tabulating missing values
- •18.3.2 Exploring missing data visually
- •18.3.3 Using correlations to explore missing values
- •18.4 Understanding the sources and impact of missing data
- •18.5 Rational approaches for dealing with incomplete data
- •18.6 Complete-case analysis (listwise deletion)
- •18.7 Multiple imputation
- •18.8 Other approaches to missing data
- •18.8.1 Pairwise deletion
- •18.8.2 Simple (nonstochastic) imputation
- •18.9 Summary
- •19 Advanced graphics with ggplot2
- •19.1 The four graphics systems in R
- •19.2 An introduction to the ggplot2 package
- •19.3 Specifying the plot type with geoms
- •19.4 Grouping
- •19.5 Faceting
- •19.6 Adding smoothed lines
- •19.7 Modifying the appearance of ggplot2 graphs
- •19.7.1 Axes
- •19.7.2 Legends
- •19.7.3 Scales
- •19.7.4 Themes
- •19.7.5 Multiple graphs per page
- •19.8 Saving graphs
- •19.9 Summary
- •20 Advanced programming
- •20.1 A review of the language
- •20.1.1 Data types
- •20.1.2 Control structures
- •20.1.3 Creating functions
- •20.2 Working with environments
- •20.3 Object-oriented programming
- •20.3.1 Generic functions
- •20.3.2 Limitations of the S3 model
- •20.4 Writing efficient code
- •20.5 Debugging
- •20.5.1 Common sources of errors
- •20.5.2 Debugging tools
- •20.5.3 Session options that support debugging
- •20.6 Going further
- •20.7 Summary
- •21 Creating a package
- •21.1 Nonparametric analysis and the npar package
- •21.1.1 Comparing groups with the npar package
- •21.2 Developing the package
- •21.2.1 Computing the statistics
- •21.2.2 Printing the results
- •21.2.3 Summarizing the results
- •21.2.4 Plotting the results
- •21.2.5 Adding sample data to the package
- •21.3 Creating the package documentation
- •21.4 Building the package
- •21.5 Going further
- •21.6 Summary
- •22 Creating dynamic reports
- •22.1 A template approach to reports
- •22.2 Creating dynamic reports with R and Markdown
- •22.3 Creating dynamic reports with R and LaTeX
- •22.4 Creating dynamic reports with R and Open Document
- •22.5 Creating dynamic reports with R and Microsoft Word
- •22.6 Summary
- •afterword Into the rabbit hole
- •appendix A Graphical user interfaces
- •appendix B Customizing the startup environment
- •appendix C Exporting data from R
- •Delimited text file
- •Excel spreadsheet
- •Statistical applications
- •appendix D Matrix algebra in R
- •appendix E Packages used in this book
- •appendix F Working with large datasets
- •F.1 Efficient programming
- •F.2 Storing data outside of RAM
- •F.3 Analytic packages for out-of-memory data
- •F.4 Comprehensive solutions for working with enormous datasets
- •appendix G Updating an R installation
- •G.1 Automated installation (Windows only)
- •G.2 Manual installation (Windows and Mac OS X)
- •G.3 Updating an R installation (Linux)
- •references
- •index
- •Symbols
- •Numerics
- •23.1 The lattice package
- •23.2 Conditioning variables
- •23.3 Panel functions
- •23.4 Grouping variables
- •23.5 Graphic parameters
- •23.6 Customizing plot strips
- •23.7 Page arrangement
- •23.8 Going further
Choosing a best predictive solution |
405 |
As stated previously, SVMs are popular because they work well in many situations. They can also handle situations in which the number of variables is much larger than the number of observations. This has made them popular in the field of biomedicine, where the number of variables collected in a typical DNA microarray study of gene expressions may be one or two orders of magnitude larger than the number of cases available.
One drawback of SVMs is that, like random forests, the resulting classification rules are difficult to understand and communicate. They’re essentially a black box. Additionally, SVMs don’t scale as well as random forests when building models from large training samples. But once a successful model is built, classifying new observations does scale well.
17.6 Choosing a best predictive solution
In sections 17.1 through 17.3, fine-needle aspiration samples were classified as malignant or benign using several supervised machine-learning techniques. Which approach was most accurate? To answer this question, we need to define the term accurate in a binary classification context.
The most commonly reported statistic is the accuracy, or how often the classifier is correct. Although informative, the accuracy is insufficient by itself. Additional information is also needed to evaluate the utility of a classification scheme.
Consider a set of rules for classifying individuals as schizophrenic or non-schizo- phrenic. Schizophrenia is a rare disorder, with a prevalence of roughly 1% in the general population. If you classify everyone as non-schizophrenic, you’ll be right 99% of time. But this isn’t a good classifier because it will also misclassify every schizophrenic as non-schizophrenic. In addition to the accuracy, you should ask these questions:
■What percentage of schizophrenics are correctly identified?
■What percentage of non-schizophrenics are correctly identified?
■If a person is classified as schizophrenic, how likely is it that this classification will be correct?
■If a person is classified as non-schizophrenic, how likely is it that this classification is correct?
These are questions pertaining to a classifier’s sensitivity, specificity, positive predictive power, and negative predictive power. Each is defined in table 17.1.
Table 17.1 Measures of predictive accuracy
Statistic |
Interpretation |
|
|
Sensitivity |
Probability of getting a positive classification when the true outcome is posi- |
|
tive (also called true positive rate or recall) |
Specificity |
Probability of getting a negative classification when the true outcome is neg- |
|
ative (also called true negative rate) |
|
|
406 |
CHAPTER 17 Classification |
Table 17.1 Measures of predictive accuracy (continued)
Statistic |
Interpretation |
|
|
Positive predictive value |
Probability that an observation with a positive classification is correctly iden- |
|
tified as positive (also called precision) |
Negative predictive value |
Probability that an observation with a negative classification is correctly |
|
identified as negative |
Accuracy |
Proportion of observations correctly identified (also called ACC) |
|
|
A function for calculating these statistics is provided next.
Listing 17.8 Function for assessing binary classification accuracy
performance <- function(table, n=2){ |
|
|||||
if(!all(dim(table) == c(2,2))) |
|
|||||
stop("Must be a 2 x 2 table") |
|
|||||
tn = table[1,1] |
|
|
|
|
|
|
|
|
|
|
|
||
fp = table[1,2] |
|
b Extracts frequencies |
|
|||
fn = table[2,1] |
|
|
|
|
|
|
tp = table[2,2] |
|
|
|
|
|
|
sensitivity = tp/(tp+fn) |
|
|
|
|||
|
|
|||||
specificity = tn/(tn+fp) |
|
c Calculates statistics |
|
|||
ppp = tp/(tp+fp) |
|
|
|
|
||
npp = tn/(tn+fn) |
|
|
|
|
|
|
hitrate = (tp+tn)/(tp+tn+fp+fn) |
|
|
|
|||
result <- paste("Sensitivity = ", round(sensitivity, n) , |
|
|||||
|
||||||
"\nSpecificity = ", round(specificity, n), |
|
|||||
"\nPositive |
Predictive Value = ", round(ppp, n), |
d Prints results |
||||
"\nNegative |
Predictive Value = ", round(npp, n), |
|||||
|
||||||
"\nAccuracy |
= ", round(hitrate, n), "\n", sep="") |
|
||||
cat(result) |
|
|
|
|
|
|
} |
|
|
|
|
|
The performance() function takes a table containing the true outcome (rows) and predicted outcome (columns) and returns the five accuracy measures. First, the number of true negatives (benign tissue identified as benign), false positives (benign tissue identified as malignant), false negatives (malignant tissue identified as benign), and true positives (malignant tissue identified as malignant) are extracted b. Next, these counts are used to calculate the sensitivity, specificity, positive and negative predictive values, and accuracy c. Finally, the results are formatted and printed d.
In the following listing, the performance() function is applied to each of the five classifiers developed in this chapter.
Listing 17.9 Performance of breast cancer data classifiers
> performance(logit.perf) Sensitivity = 0.95 Specificity = 0.98
Positive Predictive Value = 0.97
Choosing a best predictive solution |
407 |
Negative Predictive Value = 0.97
Accuracy = 0.97
>performance(dtree.perf) Sensitivity = 0.98 Specificity = 0.95
Positive Predictive Power = 0.92 Negative Predictive Power = 0.98 Accuracy = 0.96
>performance(ctree.perf) Sensitivity = 0.96 Specificity = 0.95
Positive Predictive Value = 0.92 Negative Predictive Value = 0.98 Accuracy = 0.95
>performance(forest.perf) Sensitivity = 0.99 Specificity = 0.98
Positive Predictive Value = 0.96 Negative Predictive Value = 0.99 Accuracy = 0.98
>performance(svm.perf) Sensitivity = 0.96 Specificity = 0.98
Positive Predictive Value = 0.96 Negative Predictive Value = 0.98 Accuracy = 0.97
Each of these classifiers (logistic regression, traditional decision tree, conditional inference tree, random forest, and support vector machine) performed exceedingly well on each of the accuracy measures. This won’t always be the case!
In this particular instance, the award appears to go to the random forest model (although the differences are so small, they may be due to chance). For the random forest model, 99% of malignancies were correctly identified, 98% of benign samples were correctly identified, and the overall percent of correct classifications is 99%. A diagnosis of malignancy was correct 96% of the time (for a 4% false positive rate), and a benign diagnosis was correct 99% of the time (for a 1% false negative rate). For diagnoses of cancer, the specificity (proportion of malignant samples correctly identified as malignant) is particularly important.
Although it’s beyond the scope of this chapter, you can often improve a classification system by trading specificity for sensitivity and vice versa. In the logistic regression model, predict() was used to estimate the probability that a case belonged in the malignant group. If the probability was greater than 0.5, the case was assigned to that group. The 0.5 value is called the threshold or cutoff value. If you vary this threshold, you can increase the sensitivity of the classification model at the expense of its specificity. predict() can generate probabilities for decision trees, random forests, and SVMs as well (although the syntax varies by method).
408 |
CHAPTER 17 Classification |
The impact of varying the threshold value is typically assessed using a receiver operating characteristic (ROC) curve. A ROC curve plots sensitivity versus specificity for a range of threshold values. You can then select a threshold with the best balance of sensitivity and specificity for a given problem. Many R packages generate ROC curves, including ROCR and pROC. Analytic functions in these packages can help you to select the best threshold values for a given scenario or to compare the ROC curves produced by different classification algorithms in order to choose the most useful approach. To learn more, see Kuhn & Johnson (2013). A more advanced discussion is offered by Fawcett (2005).
Until now, each classification technique has been applied to data by writing and executing code. In the next section, we’ll look at a graphical user interface that lets you develop and deploy predictive models using a visual interface.
17.7 Using the rattle package for data mining
Rattle (R Analytic Tool to Learn Easily) offers a graphic user interface (GUI) for data mining in R. It gives the user point-and-click access to many of the R functions you’ve been using in this chapter, as well as other unsupervised and supervised data models not covered here. Rattle also supports the ability to transform and score data, and it offers a number of data-visualization tools for evaluating models.
You can install the rattle package from CRAN using
install.packages("rattle")
This installs the rattle package, along with several additional packages. A full installation of Rattle and all the packages it can access would require downloading and installing hundreds of packages. To save time and space, a basic set of packages is installed by default. Other packages are installed when you first request an analysis that requires them. In this case, you’ll be prompted to install the missing package(s), and if you reply Yes, the required package will be downloaded and installed from CRAN.
Depending on your operating system and current software, you may have to install additional software. In particular, Rattle requires access to the GTK+ Toolkit. If you have difficulty, follow the OS-specific installation directions and troubleshooting suggestions offered at http://rattle.togaware.com.
Once rattle is installed, launch the interface using
library(rattle)
rattle()
The GUI (see figure 17.6) should open on top of the R console.
In this section, you’ll use Rattle to develop a conditional inference tree for predicting diabetes. The data also comes from the UCI Machine Learning Repository. The Pima Indians Diabetes dataset contains 768 cases originally collected by the National Institute of Diabetes and Digestive and Kidney Disease. The variables are as follows:
■Number of times pregnant
■Plasma glucose concentration at 2 hours in an oral glucose tolerance test
Using the rattle package for data mining |
409 |
■Diastolic blood pressure (mm Hg)
■Triceps skin fold thickness (mm)
■2-hour serum insulin (mu U/ml)
■Body mass index (weight in kg/(height in m)^2)
■Diabetes pedigree function
■Age (years)
■Class variable (0 = non-diabetic or 1 = diabetic)
Thirty-four percent of the sample was diagnosed with diabetes.
To access this data in Rattle, use the following code:
loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" ds <- "pima-indians-diabetes/pima-indians-diabetes.data"
url <- paste(loc, ds, sep="")
diabetes <- read.table(url, sep=",", header=FALSE) names(diabetes) <- c("npregant", "plasma", "bp", "triceps",
"insulin", "bmi", "pedigree", "age", "class") diabetes$class <- factor(diabetes$class, levels=c(0,1),
labels=c("normal", "diabetic"))
library(rattle)
rattle()
This downloads the data from the UCI repository, names the variables, adds labels to the outcome variable, and opens Rattle. You should be presented with the tabbed dialog box in figure 17.6.
Figure 17.6 Opening Rattle screen
410 |
CHAPTER 17 Classification |
Figure 17.7 Data tab with options to specify the role of each variable
To access the diabetes dataset, click the R Dataset radio button, and select Diabetes from the drop-down box that appears. Then click the Execute button in the upper-left corner. This opens the window shown in figure 17.7.
This window provides a description of each variable and allows you to specify the role each will play in the analyses. Here, variables 1–9 are input (predictor) variables, and class is the target (or predicted) outcome, so no changes are necessary.
You can also specify the percentage of cases to be used as a training sample, validation sample, and testing sample. Analysts frequently build models with a training sample, fine-tune parameters with a validation sample, and evaluate the results with a testing sample. By default, Rattle uses a 70/15/15 split and a seed value of 42.
You’ll divide the data into training and validation samples, skipping the test sample. Therefore, enter 70/30/0 in the Partition text box and 1234 in the Seed text box, and click Execute again.
Now let’s fit a prediction model. To generate a conditional inference tree, select the Model tab. Be sure the Tree radio button is selected (the default); and for Algorithm, choose the Conditional radio button. Clicking Execute builds the model using the ctree() function in the party package and displays the results in the bottom of the window (see figure 17.8).
Clicking the Draw button produces an attractive graph (see figure 17.9). (Hint: specifying Use Cairo Graphics in the Settings menu before clicking Draw often produces a more attractive plot.)
Using the rattle package for data mining |
411 |
Figure 17.8 Model tab with options to build decision trees, random forests, support vector machines, and more. Here, a conditional inference tree has been fitted to the training data.
|
|
1 |
|
|
|
plasma |
|
|
|
p < 0.001 |
|
|
≤ 123 |
> 123 |
|
|
2 |
7 |
|
|
npregant |
plasma |
|
|
p < 0.001 |
p < 0.001 |
|
≤ 6 |
> 6 |
≤ 157 |
> 157 |
3 |
|
|
9 |
age |
|
|
age |
p = 0.001 |
|
|
p = 0.012 |
|
≤ 34 |
|
> 34 |
|
|
|
|
|
≤ 59 |
|
> 59 |
|
Node 4 (n = 216) Node 5 (n = 46) |
Node 6 (n = 50) Node 8 (n = 148)Node 10 (n = 68) Node 11 (n = 9) |
|||||||||||
normal |
1 |
normal |
1 |
normal |
1 |
normal |
1 |
normal |
1 |
normal |
1 |
|
0.8 |
0.8 |
0.8 |
0.8 |
0.8 |
0.8 |
|||||||
0.6 |
0.6 |
0.6 |
0.6 |
0.6 |
0.6 |
|||||||
|
|
|
|
|
|
|||||||
diabetic |
0.4 |
diabetic |
0.4 |
diabetic |
0.4 |
diabetic |
0.4 |
diabetic |
0.4 |
diabetic |
0.4 |
|
0.2 |
0.2 |
0.2 |
0.2 |
0.2 |
0.2 |
|||||||
0 |
0 |
0 |
0 |
0 |
0 |
|||||||
|
|
|
|
|
|
Figure 17.9 Tree diagram for the conditional inference tree using the diabetes training sample
412 |
CHAPTER 17 Classification |
Figure 17.10 Evaluation tab with the error matrix for the conditional inference tree calculated on the validation sample
To evaluate the fitted model, select the Evaluate tab. Here you can specify a number of evaluative criteria and the sample (training, validation) to use. By default, the error matrix (also called a confusion matrix in this chapter) is selected. Clicking Execute produces the results shown in figure 17.10.
You can import the error matrix into the performance() function to obtain the accuracy statistics:
>cv <- matrix(c(145, 50, 8, 27), nrow=2)
>performance(as.table(cv))
Sensitivity = 0.35
Specificity = 0.95
Positive Predictive Value = 0.77
Negative Predictive Value = 0.74
Accuracy = 0.75
Although the overall accuracy (75%) isn’t terrible, only 35% of diabetics were correctly identified. Try to develop a better classification scheme using random forests or support vector machines—it can be done.
A significant advantage of using Rattle is the ability to fit multiple models to the same dataset and compare each model directly on the Evaluate tab. Check each method on this tab that you want to compare, and click Execute. Additionally, all the