Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

1Foundation of Mathematical Biology / The Elements of Statistical Learning

.pdf
Скачиваний:
48
Добавлен:
15.08.2013
Размер:
287.66 Кб
Скачать

Smoothing Splines

Avoid knot selection problem by regularization.

For all fns f with two cts derivatives minimize

N

RSS( f ; λ) = fyi f (xi)g2 + λZ f f (t)g2 dt

i=1

First term measures closeness to data, second term penalizes curvature in f ; λ effects trade-off:

λ= 0 : f any interpolating function, (very rough)

λ= ∞ : f simple least squares fit (very smooth).

Soln: natural cubic spline with knots at unique xi.

ˆ

= ( ˆ

(xi)) = Sλy. Calibrate

Linear smoother: f

f

smoothing parameter λ via dfλ = trace(Sλ).

Pick λ by cross-validation; GCV.

Additive Models

Multiple linear regression:

E(Y jX1; : : : ; Xp) = β0 + β1X1 + : : : + βpXp

Additive model extension:

E(Y jX1; : : : ; Xp) = β0 + s1(X1)+ : : : + sp(Xp)

Estimation of s j via backfitting algorithm:

ˆ

1

N

 

 

1. Initialize: β0 =

 

i=1 yi; sˆ j 0

8 j.

N

2. Cycle: j = 1; 2; : : : ; p; : : : ; 1; 2; : : : ;

 

 

 

 

ˆ

 

N

sˆ j Smooth j "fyi β0

sˆk(xik)g1 #

 

 

 

 

k j

 

until the sˆ j

converge.

 

 

Same generalization – replacing linear predictor with sum of smooth functions – and backfitting method applies to binary, count outcomes.

Prostate Cancer: Additive Model Fits

 

 

 

 

 

 

 

 

 

 

 

 

1.5

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

1.0

 

1

 

 

 

 

 

 

 

 

 

 

 

s(lcavol)

0

 

 

 

 

s(lweight)

0

 

 

 

s(age)

0.5

 

 

 

 

 

 

 

 

 

 

 

 

0.0

 

-1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

-1

 

 

 

 

-0.5

 

 

 

 

 

 

 

 

 

 

 

 

 

-2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

-1.0

 

-1

0

1

2

3

4

 

3

4

5

6

 

 

 

 

lcavol

 

 

 

 

 

lweight

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1.0

 

0.6

 

 

 

 

 

 

 

 

 

 

 

 

0.4

 

 

 

 

 

0.5

 

 

 

 

0.5

s(lbph)

0.0 0.2

 

 

 

 

s(lcp)

0.0

 

 

 

s(pgg45)

0.0

-0.2

-0.5

-0.5

-0.4

 

1.0

 

 

-

0.6

-1.0

 

-

 

 

40

50

60

70

80

 

 

age

 

 

-1

0

1

2

-1

0

1

2

3

0

20

40

60

80

100

 

 

lbph

 

 

 

lcp

 

 

 

 

 

pgg45

 

 

Prostate Cancer: Additive Model

 

Df

NparDf

Npar F

Pr(F)

s(lcavol)

1

3

1.15

0.33

s(lweight)

1

3

1.65

0.18

s(lcp)

1

3

2.11

0.10

s(pgg45)

1

3

1.15

0.33

Initial Model:

lpsa ˜ s(lcavol) + s(lweight) + s(lcp) + s(pgg45) Final Model:

lpsa ˜ lcavol + lweight + s(lcp) + s(pgg45)

 

From

To

Df Resid Df

AIC

1

 

 

 

80

57.5

2

s(lweight) s(lweight, 2)

2

82

56.4

3

s(lcavol)

s(lcavol, 2)

2

84

55.6

4

s(lcavol, 2)

lcavol

1

85

55.3

5

s(lweight, 2)

lweight

1

86

55.3

Tree-Structured Regression Paradigm

Tree-based methods involve four components:

1.A set of questions - splits - phrased in terms of covariates that serve to partition the covariate space. A tree structure derives from recursive splitting and a binary tree results if the questions are yes/no. The subgroups created by assigning cases according to splits are termed nodes.

2.A split f unction φ(s; g) that can be evaluated for any split s of any node g.

The split function is used to assess the worth of the competing splits.

3.A means for determining appropriate tree size.

4.Statistical summaries for the nodes of the tree.

Allowable Splits

An interpretable, flexible, feasible set of splits is

obtained by constraining that

1.each split depends upon the value of only a single covariate,

2.for continuous or ordered categorical covariates,

Xj, only splits resulting from questions of the form “Is Xj c ?” for c 2 domain(Xj) are considered; thus ordering is preserved,

3.for categorical covariates all possible splits into disjoint subsets of the categories are allowed.

Growing a Tree

1.Initialize: root node comprises the entire sample.

2.Recurse: for every terminal node, g,

(a)examine all splits, s, on each covariate,

(b)select and execute (create left, gL, and right, gR, daughter nodes) the best of these splits.

3.Stopping: grow large; prune back.

4.Selection: cross-validation, test sample.

Best split determined by split function φ(s; g).

y¯ = (1=N )

i

2

g

y outcome average for node g.

g

g

 

i

 

 

 

 

Within node sum-of-squares: SS(g) =

i

2

g

(y y¯ )2.

 

 

 

 

 

 

 

i g

Define

φ(s; g) = SS(g) SS(gL) SS(gR).

 

 

Best split s such that φ(s ; g) = maxs φ(s; g)

Easily computed via updating formulae.

Prostate Cancer: Regression Tree

 

 

 

2.4780

 

 

 

 

 

|

 

 

 

 

 

n=97

 

 

 

lcavol<2.46165

 

 

 

 

 

 

lcavol>2.46165

 

2.1230

 

 

 

3.7650

 

n=76

 

 

 

n=21

 

lcavol<-0.478556

 

 

lcavol<2.79352

 

lcavol>-0.478556

 

 

lcavol>2.79352

 

0.6017

2.3270

 

3.2840

4.2030

 

n=9

n=67

 

n=10

n=11

 

lweight<3.68886

 

 

 

 

 

 

 

lweight>3.68886

 

 

2.0330

 

 

2.7120

 

 

n=38

 

 

n=29

 

pgg45<7.5

 

lcavol<0.821736

 

 

pgg45>7.5

 

 

lcavol>0.821736

1.7250

2.4130

 

2.2880

2.9360

n=21

n=17

 

n=10

 

n=19

lcavol<0.774462

 

 

 

 

 

lcavol>0.774462

 

 

 

 

1.2630

2.0100

 

 

 

 

n=8

n=13

 

 

 

 

Prostate Cancer: Regression Tree

 

1.2

 

Cross-Validation

 

 

 

 

 

 

 

 

Training Error

 

 

1.0

 

 

 

Error

0.8

 

 

 

Squared

0.6

 

 

 

Relative

0.4

 

 

 

 

0.2

 

 

 

 

0.0

 

 

 

 

0

2

4

6

Number of Splits

Prostate Cancer: Pruned Regression Tree

2.4780

|

n=97

lcavol<2.46165

lcavol>2.46165

2.1230

3.7650

n=76

n=21

lcavol<-0.478556

lcavol>-0.478556

0.6017

 

2.3270

n=9

 

n=67