Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

Choosing a best predictive solution

405

As stated previously, SVMs are popular because they work well in many situations. They can also handle situations in which the number of variables is much larger than the number of observations. This has made them popular in the field of biomedicine, where the number of variables collected in a typical DNA microarray study of gene expressions may be one or two orders of magnitude larger than the number of cases available.

One drawback of SVMs is that, like random forests, the resulting classification rules are difficult to understand and communicate. They’re essentially a black box. Additionally, SVMs don’t scale as well as random forests when building models from large training samples. But once a successful model is built, classifying new observations does scale well.

17.6 Choosing a best predictive solution

In sections 17.1 through 17.3, fine-needle aspiration samples were classified as malignant or benign using several supervised machine-learning techniques. Which approach was most accurate? To answer this question, we need to define the term accurate in a binary classification context.

The most commonly reported statistic is the accuracy, or how often the classifier is correct. Although informative, the accuracy is insufficient by itself. Additional information is also needed to evaluate the utility of a classification scheme.

Consider a set of rules for classifying individuals as schizophrenic or non-schizo- phrenic. Schizophrenia is a rare disorder, with a prevalence of roughly 1% in the general population. If you classify everyone as non-schizophrenic, you’ll be right 99% of time. But this isn’t a good classifier because it will also misclassify every schizophrenic as non-schizophrenic. In addition to the accuracy, you should ask these questions:

What percentage of schizophrenics are correctly identified?

What percentage of non-schizophrenics are correctly identified?

If a person is classified as schizophrenic, how likely is it that this classification will be correct?

If a person is classified as non-schizophrenic, how likely is it that this classification is correct?

These are questions pertaining to a classifier’s sensitivity, specificity, positive predictive power, and negative predictive power. Each is defined in table 17.1.

Table 17.1 Measures of predictive accuracy

Statistic

Interpretation

 

 

Sensitivity

Probability of getting a positive classification when the true outcome is posi-

 

tive (also called true positive rate or recall)

Specificity

Probability of getting a negative classification when the true outcome is neg-

 

ative (also called true negative rate)

 

 

406

CHAPTER 17 Classification

Table 17.1 Measures of predictive accuracy (continued)

Statistic

Interpretation

 

 

Positive predictive value

Probability that an observation with a positive classification is correctly iden-

 

tified as positive (also called precision)

Negative predictive value

Probability that an observation with a negative classification is correctly

 

identified as negative

Accuracy

Proportion of observations correctly identified (also called ACC)

 

 

A function for calculating these statistics is provided next.

Listing 17.8 Function for assessing binary classification accuracy

performance <- function(table, n=2){

 

if(!all(dim(table) == c(2,2)))

 

stop("Must be a 2 x 2 table")

 

tn = table[1,1]

 

 

 

 

 

 

 

 

 

 

fp = table[1,2]

 

b Extracts frequencies

 

fn = table[2,1]

 

 

 

 

 

tp = table[2,2]

 

 

 

 

 

sensitivity = tp/(tp+fn)

 

 

 

 

 

specificity = tn/(tn+fp)

 

c Calculates statistics

 

ppp = tp/(tp+fp)

 

 

 

 

npp = tn/(tn+fn)

 

 

 

 

 

hitrate = (tp+tn)/(tp+tn+fp+fn)

 

 

 

result <- paste("Sensitivity = ", round(sensitivity, n) ,

 

 

"\nSpecificity = ", round(specificity, n),

 

"\nPositive

Predictive Value = ", round(ppp, n),

d Prints results

"\nNegative

Predictive Value = ", round(npp, n),

 

"\nAccuracy

= ", round(hitrate, n), "\n", sep="")

 

cat(result)

 

 

 

 

 

}

 

 

 

 

 

The performance() function takes a table containing the true outcome (rows) and predicted outcome (columns) and returns the five accuracy measures. First, the number of true negatives (benign tissue identified as benign), false positives (benign tissue identified as malignant), false negatives (malignant tissue identified as benign), and true positives (malignant tissue identified as malignant) are extracted b. Next, these counts are used to calculate the sensitivity, specificity, positive and negative predictive values, and accuracy c. Finally, the results are formatted and printed d.

In the following listing, the performance() function is applied to each of the five classifiers developed in this chapter.

Listing 17.9 Performance of breast cancer data classifiers

> performance(logit.perf) Sensitivity = 0.95 Specificity = 0.98

Positive Predictive Value = 0.97

Choosing a best predictive solution

407

Negative Predictive Value = 0.97

Accuracy = 0.97

>performance(dtree.perf) Sensitivity = 0.98 Specificity = 0.95

Positive Predictive Power = 0.92 Negative Predictive Power = 0.98 Accuracy = 0.96

>performance(ctree.perf) Sensitivity = 0.96 Specificity = 0.95

Positive Predictive Value = 0.92 Negative Predictive Value = 0.98 Accuracy = 0.95

>performance(forest.perf) Sensitivity = 0.99 Specificity = 0.98

Positive Predictive Value = 0.96 Negative Predictive Value = 0.99 Accuracy = 0.98

>performance(svm.perf) Sensitivity = 0.96 Specificity = 0.98

Positive Predictive Value = 0.96 Negative Predictive Value = 0.98 Accuracy = 0.97

Each of these classifiers (logistic regression, traditional decision tree, conditional inference tree, random forest, and support vector machine) performed exceedingly well on each of the accuracy measures. This won’t always be the case!

In this particular instance, the award appears to go to the random forest model (although the differences are so small, they may be due to chance). For the random forest model, 99% of malignancies were correctly identified, 98% of benign samples were correctly identified, and the overall percent of correct classifications is 99%. A diagnosis of malignancy was correct 96% of the time (for a 4% false positive rate), and a benign diagnosis was correct 99% of the time (for a 1% false negative rate). For diagnoses of cancer, the specificity (proportion of malignant samples correctly identified as malignant) is particularly important.

Although it’s beyond the scope of this chapter, you can often improve a classification system by trading specificity for sensitivity and vice versa. In the logistic regression model, predict() was used to estimate the probability that a case belonged in the malignant group. If the probability was greater than 0.5, the case was assigned to that group. The 0.5 value is called the threshold or cutoff value. If you vary this threshold, you can increase the sensitivity of the classification model at the expense of its specificity. predict() can generate probabilities for decision trees, random forests, and SVMs as well (although the syntax varies by method).

408

CHAPTER 17 Classification

The impact of varying the threshold value is typically assessed using a receiver operating characteristic (ROC) curve. A ROC curve plots sensitivity versus specificity for a range of threshold values. You can then select a threshold with the best balance of sensitivity and specificity for a given problem. Many R packages generate ROC curves, including ROCR and pROC. Analytic functions in these packages can help you to select the best threshold values for a given scenario or to compare the ROC curves produced by different classification algorithms in order to choose the most useful approach. To learn more, see Kuhn & Johnson (2013). A more advanced discussion is offered by Fawcett (2005).

Until now, each classification technique has been applied to data by writing and executing code. In the next section, we’ll look at a graphical user interface that lets you develop and deploy predictive models using a visual interface.

17.7 Using the rattle package for data mining

Rattle (R Analytic Tool to Learn Easily) offers a graphic user interface (GUI) for data mining in R. It gives the user point-and-click access to many of the R functions you’ve been using in this chapter, as well as other unsupervised and supervised data models not covered here. Rattle also supports the ability to transform and score data, and it offers a number of data-visualization tools for evaluating models.

You can install the rattle package from CRAN using

install.packages("rattle")

This installs the rattle package, along with several additional packages. A full installation of Rattle and all the packages it can access would require downloading and installing hundreds of packages. To save time and space, a basic set of packages is installed by default. Other packages are installed when you first request an analysis that requires them. In this case, you’ll be prompted to install the missing package(s), and if you reply Yes, the required package will be downloaded and installed from CRAN.

Depending on your operating system and current software, you may have to install additional software. In particular, Rattle requires access to the GTK+ Toolkit. If you have difficulty, follow the OS-specific installation directions and troubleshooting suggestions offered at http://rattle.togaware.com.

Once rattle is installed, launch the interface using

library(rattle)

rattle()

The GUI (see figure 17.6) should open on top of the R console.

In this section, you’ll use Rattle to develop a conditional inference tree for predicting diabetes. The data also comes from the UCI Machine Learning Repository. The Pima Indians Diabetes dataset contains 768 cases originally collected by the National Institute of Diabetes and Digestive and Kidney Disease. The variables are as follows:

Number of times pregnant

Plasma glucose concentration at 2 hours in an oral glucose tolerance test

Using the rattle package for data mining

409

Diastolic blood pressure (mm Hg)

Triceps skin fold thickness (mm)

2-hour serum insulin (mu U/ml)

Body mass index (weight in kg/(height in m)^2)

Diabetes pedigree function

Age (years)

Class variable (0 = non-diabetic or 1 = diabetic)

Thirty-four percent of the sample was diagnosed with diabetes.

To access this data in Rattle, use the following code:

loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" ds <- "pima-indians-diabetes/pima-indians-diabetes.data"

url <- paste(loc, ds, sep="")

diabetes <- read.table(url, sep=",", header=FALSE) names(diabetes) <- c("npregant", "plasma", "bp", "triceps",

"insulin", "bmi", "pedigree", "age", "class") diabetes$class <- factor(diabetes$class, levels=c(0,1),

labels=c("normal", "diabetic"))

library(rattle)

rattle()

This downloads the data from the UCI repository, names the variables, adds labels to the outcome variable, and opens Rattle. You should be presented with the tabbed dialog box in figure 17.6.

Figure 17.6 Opening Rattle screen

410

CHAPTER 17 Classification

Figure 17.7 Data tab with options to specify the role of each variable

To access the diabetes dataset, click the R Dataset radio button, and select Diabetes from the drop-down box that appears. Then click the Execute button in the upper-left corner. This opens the window shown in figure 17.7.

This window provides a description of each variable and allows you to specify the role each will play in the analyses. Here, variables 1–9 are input (predictor) variables, and class is the target (or predicted) outcome, so no changes are necessary.

You can also specify the percentage of cases to be used as a training sample, validation sample, and testing sample. Analysts frequently build models with a training sample, fine-tune parameters with a validation sample, and evaluate the results with a testing sample. By default, Rattle uses a 70/15/15 split and a seed value of 42.

You’ll divide the data into training and validation samples, skipping the test sample. Therefore, enter 70/30/0 in the Partition text box and 1234 in the Seed text box, and click Execute again.

Now let’s fit a prediction model. To generate a conditional inference tree, select the Model tab. Be sure the Tree radio button is selected (the default); and for Algorithm, choose the Conditional radio button. Clicking Execute builds the model using the ctree() function in the party package and displays the results in the bottom of the window (see figure 17.8).

Clicking the Draw button produces an attractive graph (see figure 17.9). (Hint: specifying Use Cairo Graphics in the Settings menu before clicking Draw often produces a more attractive plot.)

Using the rattle package for data mining

411

Figure 17.8 Model tab with options to build decision trees, random forests, support vector machines, and more. Here, a conditional inference tree has been fitted to the training data.

 

 

1

 

 

 

plasma

 

 

 

p < 0.001

 

 

123

> 123

 

 

2

7

 

 

npregant

plasma

 

 

p < 0.001

p < 0.001

 

6

> 6

157

> 157

3

 

 

9

age

 

 

age

p = 0.001

 

 

p = 0.012

 

34

 

> 34

 

 

 

 

 

59

 

> 59

Node 4 (n = 216) Node 5 (n = 46)

Node 6 (n = 50) Node 8 (n = 148)Node 10 (n = 68) Node 11 (n = 9)

normal

1

normal

1

normal

1

normal

1

normal

1

normal

1

0.8

0.8

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.6

0.6

 

 

 

 

 

 

diabetic

0.4

diabetic

0.4

diabetic

0.4

diabetic

0.4

diabetic

0.4

diabetic

0.4

0.2

0.2

0.2

0.2

0.2

0.2

0

0

0

0

0

0

 

 

 

 

 

 

Figure 17.9 Tree diagram for the conditional inference tree using the diabetes training sample

412

CHAPTER 17 Classification

Figure 17.10 Evaluation tab with the error matrix for the conditional inference tree calculated on the validation sample

To evaluate the fitted model, select the Evaluate tab. Here you can specify a number of evaluative criteria and the sample (training, validation) to use. By default, the error matrix (also called a confusion matrix in this chapter) is selected. Clicking Execute produces the results shown in figure 17.10.

You can import the error matrix into the performance() function to obtain the accuracy statistics:

>cv <- matrix(c(145, 50, 8, 27), nrow=2)

>performance(as.table(cv))

Sensitivity = 0.35

Specificity = 0.95

Positive Predictive Value = 0.77

Negative Predictive Value = 0.74

Accuracy = 0.75

Although the overall accuracy (75%) isn’t terrible, only 35% of diabetics were correctly identified. Try to develop a better classification scheme using random forests or support vector machines—it can be done.

A significant advantage of using Rattle is the ability to fit multiple models to the same dataset and compare each model directly on the Evaluate tab. Check each method on this tab that you want to compare, and click Execute. Additionally, all the

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]