Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

R in Action, Second Edition.pdf

Скачиваний:

540

Добавлен:

26.03.2016

Размер:

20.33 Mб

Скачать

☆

<<< < Предыдущая 113 114 115 116 117 118 119 120 121 122 123 124125 / 173125 126 127 128 129 130 131 132 133 134 135 136 137 > Следующая >>>

Choosing a best predictive solution

405

As stated previously, SVMs are popular because they work well in many situations. They can also handle situations in which the number of variables is much larger than the number of observations. This has made them popular in the field of biomedicine, where the number of variables collected in a typical DNA microarray study of gene expressions may be one or two orders of magnitude larger than the number of cases available.

One drawback of SVMs is that, like random forests, the resulting classification rules are difficult to understand and communicate. They’re essentially a black box. Additionally, SVMs don’t scale as well as random forests when building models from large training samples. But once a successful model is built, classifying new observations does scale well.

17.6 Choosing a best predictive solution

In sections 17.1 through 17.3, fine-needle aspiration samples were classified as malignant or benign using several supervised machine-learning techniques. Which approach was most accurate? To answer this question, we need to define the term accurate in a binary classification context.

The most commonly reported statistic is the accuracy, or how often the classifier is correct. Although informative, the accuracy is insufficient by itself. Additional information is also needed to evaluate the utility of a classification scheme.

Consider a set of rules for classifying individuals as schizophrenic or non-schizo- phrenic. Schizophrenia is a rare disorder, with a prevalence of roughly 1% in the general population. If you classify everyone as non-schizophrenic, you’ll be right 99% of time. But this isn’t a good classifier because it will also misclassify every schizophrenic as non-schizophrenic. In addition to the accuracy, you should ask these questions:

■What percentage of schizophrenics are correctly identified?

■What percentage of non-schizophrenics are correctly identified?

■If a person is classified as schizophrenic, how likely is it that this classification will be correct?

■If a person is classified as non-schizophrenic, how likely is it that this classification is correct?

These are questions pertaining to a classifier’s sensitivity, specificity, positive predictive power, and negative predictive power. Each is defined in table 17.1.

Table 17.1 Measures of predictive accuracy

Statistic	Interpretation

Sensitivity	Probability of getting a positive classification when the true outcome is posi-
	tive (also called true positive rate or recall)
Specificity	Probability of getting a negative classification when the true outcome is neg-
	ative (also called true negative rate)

406	CHAPTER 17 Classification

Table 17.1 Measures of predictive accuracy (continued)

Statistic	Interpretation

Positive predictive value	Probability that an observation with a positive classification is correctly iden-
	tified as positive (also called precision)
Negative predictive value	Probability that an observation with a negative classification is correctly
	identified as negative
Accuracy	Proportion of observations correctly identified (also called ACC)

A function for calculating these statistics is provided next.

Listing 17.8 Function for assessing binary classification accuracy

performance <- function(table, n=2){
if(!all(dim(table) == c(2,2)))
stop("Must be a 2 x 2 table")
tn = table[1,1]
tn = table[1,1]
fp = table[1,2]		b Extracts frequencies
fn = table[2,1]
tp = table[2,2]
sensitivity = tp/(tp+fn)
sensitivity = tp/(tp+fn)
specificity = tn/(tn+fp)			c Calculates statistics
ppp = tp/(tp+fp)			c Calculates statistics
npp = tn/(tn+fn)
hitrate = (tp+tn)/(tp+tn+fp+fn)
result <- paste("Sensitivity = ", round(sensitivity, n) ,
result <- paste("Sensitivity = ", round(sensitivity, n) ,
"\nSpecificity = ", round(specificity, n),
"\nPositive	Predictive Value = ", round(ppp, n),			d Prints results
"\nNegative	Predictive Value = ", round(npp, n),			d Prints results
"\nNegative	Predictive Value = ", round(npp, n),
"\nAccuracy	= ", round(hitrate, n), "\n", sep="")
cat(result)
}

The performance() function takes a table containing the true outcome (rows) and predicted outcome (columns) and returns the five accuracy measures. First, the number of true negatives (benign tissue identified as benign), false positives (benign tissue identified as malignant), false negatives (malignant tissue identified as benign), and true positives (malignant tissue identified as malignant) are extracted b. Next, these counts are used to calculate the sensitivity, specificity, positive and negative predictive values, and accuracy c. Finally, the results are formatted and printed d.

In the following listing, the performance() function is applied to each of the five classifiers developed in this chapter.

Listing 17.9 Performance of breast cancer data classifiers

> performance(logit.perf) Sensitivity = 0.95 Specificity = 0.98

Positive Predictive Value = 0.97

Choosing a best predictive solution

407

Negative Predictive Value = 0.97

Accuracy = 0.97

>performance(dtree.perf) Sensitivity = 0.98 Specificity = 0.95

Positive Predictive Power = 0.92 Negative Predictive Power = 0.98 Accuracy = 0.96

>performance(ctree.perf) Sensitivity = 0.96 Specificity = 0.95

Positive Predictive Value = 0.92 Negative Predictive Value = 0.98 Accuracy = 0.95

>performance(forest.perf) Sensitivity = 0.99 Specificity = 0.98

Positive Predictive Value = 0.96 Negative Predictive Value = 0.99 Accuracy = 0.98

>performance(svm.perf) Sensitivity = 0.96 Specificity = 0.98

Positive Predictive Value = 0.96 Negative Predictive Value = 0.98 Accuracy = 0.97

Each of these classifiers (logistic regression, traditional decision tree, conditional inference tree, random forest, and support vector machine) performed exceedingly well on each of the accuracy measures. This won’t always be the case!

In this particular instance, the award appears to go to the random forest model (although the differences are so small, they may be due to chance). For the random forest model, 99% of malignancies were correctly identified, 98% of benign samples were correctly identified, and the overall percent of correct classifications is 99%. A diagnosis of malignancy was correct 96% of the time (for a 4% false positive rate), and a benign diagnosis was correct 99% of the time (for a 1% false negative rate). For diagnoses of cancer, the specificity (proportion of malignant samples correctly identified as malignant) is particularly important.

Although it’s beyond the scope of this chapter, you can often improve a classification system by trading specificity for sensitivity and vice versa. In the logistic regression model, predict() was used to estimate the probability that a case belonged in the malignant group. If the probability was greater than 0.5, the case was assigned to that group. The 0.5 value is called the threshold or cutoff value. If you vary this threshold, you can increase the sensitivity of the classification model at the expense of its specificity. predict() can generate probabilities for decision trees, random forests, and SVMs as well (although the syntax varies by method).

408	CHAPTER 17 Classification

The impact of varying the threshold value is typically assessed using a receiver operating characteristic (ROC) curve. A ROC curve plots sensitivity versus specificity for a range of threshold values. You can then select a threshold with the best balance of sensitivity and specificity for a given problem. Many R packages generate ROC curves, including ROCR and pROC. Analytic functions in these packages can help you to select the best threshold values for a given scenario or to compare the ROC curves produced by different classification algorithms in order to choose the most useful approach. To learn more, see Kuhn & Johnson (2013). A more advanced discussion is offered by Fawcett (2005).

Until now, each classification technique has been applied to data by writing and executing code. In the next section, we’ll look at a graphical user interface that lets you develop and deploy predictive models using a visual interface.

17.7 Using the rattle package for data mining

Rattle (R Analytic Tool to Learn Easily) offers a graphic user interface (GUI) for data mining in R. It gives the user point-and-click access to many of the R functions you’ve been using in this chapter, as well as other unsupervised and supervised data models not covered here. Rattle also supports the ability to transform and score data, and it offers a number of data-visualization tools for evaluating models.

You can install the rattle package from CRAN using

install.packages("rattle")

This installs the rattle package, along with several additional packages. A full installation of Rattle and all the packages it can access would require downloading and installing hundreds of packages. To save time and space, a basic set of packages is installed by default. Other packages are installed when you first request an analysis that requires them. In this case, you’ll be prompted to install the missing package(s), and if you reply Yes, the required package will be downloaded and installed from CRAN.

Depending on your operating system and current software, you may have to install additional software. In particular, Rattle requires access to the GTK+ Toolkit. If you have difficulty, follow the OS-specific installation directions and troubleshooting suggestions offered at http://rattle.togaware.com.

Once rattle is installed, launch the interface using

library(rattle)

rattle()

The GUI (see figure 17.6) should open on top of the R console.

In this section, you’ll use Rattle to develop a conditional inference tree for predicting diabetes. The data also comes from the UCI Machine Learning Repository. The Pima Indians Diabetes dataset contains 768 cases originally collected by the National Institute of Diabetes and Digestive and Kidney Disease. The variables are as follows:

■Number of times pregnant

■Plasma glucose concentration at 2 hours in an oral glucose tolerance test

Using the rattle package for data mining

409

■Diastolic blood pressure (mm Hg)

■Triceps skin fold thickness (mm)

■2-hour serum insulin (mu U/ml)

■Body mass index (weight in kg/(height in m)^2)

■Diabetes pedigree function

■Age (years)

■Class variable (0 = non-diabetic or 1 = diabetic)

Thirty-four percent of the sample was diagnosed with diabetes.

To access this data in Rattle, use the following code:

loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" ds <- "pima-indians-diabetes/pima-indians-diabetes.data"

url <- paste(loc, ds, sep="")

diabetes <- read.table(url, sep=",", header=FALSE) names(diabetes) <- c("npregant", "plasma", "bp", "triceps",

"insulin", "bmi", "pedigree", "age", "class") diabetes$class <- factor(diabetes$class, levels=c(0,1),

labels=c("normal", "diabetic"))

library(rattle)

rattle()

This downloads the data from the UCI repository, names the variables, adds labels to the outcome variable, and opens Rattle. You should be presented with the tabbed dialog box in figure 17.6.

Figure 17.6 Opening Rattle screen

410	CHAPTER 17 Classification

Figure 17.7 Data tab with options to specify the role of each variable

To access the diabetes dataset, click the R Dataset radio button, and select Diabetes from the drop-down box that appears. Then click the Execute button in the upper-left corner. This opens the window shown in figure 17.7.

This window provides a description of each variable and allows you to specify the role each will play in the analyses. Here, variables 1–9 are input (predictor) variables, and class is the target (or predicted) outcome, so no changes are necessary.

You can also specify the percentage of cases to be used as a training sample, validation sample, and testing sample. Analysts frequently build models with a training sample, fine-tune parameters with a validation sample, and evaluate the results with a testing sample. By default, Rattle uses a 70/15/15 split and a seed value of 42.

You’ll divide the data into training and validation samples, skipping the test sample. Therefore, enter 70/30/0 in the Partition text box and 1234 in the Seed text box, and click Execute again.

Now let’s fit a prediction model. To generate a conditional inference tree, select the Model tab. Be sure the Tree radio button is selected (the default); and for Algorithm, choose the Conditional radio button. Clicking Execute builds the model using the ctree() function in the party package and displays the results in the bottom of the window (see figure 17.8).

Clicking the Draw button produces an attractive graph (see figure 17.9). (Hint: specifying Use Cairo Graphics in the Settings menu before clicking Draw often produces a more attractive plot.)

Using the rattle package for data mining

411

Figure 17.8 Model tab with options to build decision trees, random forests, support vector machines, and more. Here, a conditional inference tree has been fitted to the training data.

		1
		plasma
		p < 0.001
	≤ 123	> 123
	2	7
	npregant	plasma
	p < 0.001	p < 0.001
≤ 6	> 6	≤ 157	> 157
3			9
age			age
p = 0.001			p = 0.012

	≤ 34		> 34						≤ 59		> 59
Node 4 (n = 216) Node 5 (n = 46)				Node 6 (n = 50) Node 8 (n = 148)Node 10 (n = 68) Node 11 (n = 9)
normal	1	normal	1	normal	1	normal	1	normal	1	normal	1
	0.8		0.8		0.8		0.8		0.8		0.8
	0.6		0.6		0.6		0.6		0.6		0.6

diabetic	0.4	diabetic	0.4	diabetic	0.4	diabetic	0.4	diabetic	0.4	diabetic	0.4
	0.2		0.2		0.2		0.2		0.2		0.2
	0		0		0		0		0		0

Figure 17.9 Tree diagram for the conditional inference tree using the diabetes training sample

412	CHAPTER 17 Classification

Figure 17.10 Evaluation tab with the error matrix for the conditional inference tree calculated on the validation sample

To evaluate the fitted model, select the Evaluate tab. Here you can specify a number of evaluative criteria and the sample (training, validation) to use. By default, the error matrix (also called a confusion matrix in this chapter) is selected. Clicking Execute produces the results shown in figure 17.10.

You can import the error matrix into the performance() function to obtain the accuracy statistics:

>cv <- matrix(c(145, 50, 8, 27), nrow=2)

>performance(as.table(cv))

Sensitivity = 0.35

Specificity = 0.95

Positive Predictive Value = 0.77

Negative Predictive Value = 0.74

Accuracy = 0.75

Although the overall accuracy (75%) isn’t terrible, only 35% of diabetics were correctly identified. Try to develop a better classification scheme using random forests or support vector machines—it can be done.

A significant advantage of using Rattle is the ability to fit multiple models to the same dataset and compare each model directly on the Evaluate tab. Check each method on this tab that you want to compare, and click Execute. Additionally, all the

<<< < Предыдущая 113 114 115 116 117 118 119 120 121 122 123 124125 / 173125 126 127 128 129 130 131 132 133 134 135 136 137 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
05.08.2019741.83 Кб1psihologia.rtf
#
02.06.2015162.69 Кб77Psyh_final_ver.docx
#
02.06.2015141.74 Кб44Psyh_final_ver.docx
#
26.03.2016226.3 Кб23public_corporation.doc
#
26.03.2016451.53 Кб7pud_finansovyy-menedjment_318476.pdf
#
26.03.201620.33 Mб540R in Action, Second Edition.pdf
#
26.03.2016296.21 Кб17Radaev_Kak_napisat_akademicheskiy_text.pdf
#
26.03.20163.76 Mб4Raeff_Modernity.pdf
#
26.03.20162.12 Mб19raigorodskii_d_ya_hrestomatiya_psihologiya_lich.pdf
#
02.06.2015494.59 Кб6raschet_SRK_smorodin.doc
#
02.06.201563.98 Кб4referat_IOGP_3.docx