Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Казанский национальный исследовательский технологический университет

Предмет:

Химия

Файл:

1Foundation of Mathematical Biology / The Elements of Statistical Learning

.pdf

Скачиваний:

Добавлен:

15.08.2013

Размер:

287.66 Кб

Скачать

☆

<<< < Предыдущая 1 2 3 45 / 65 6 > Следующая >>>

Smoothing Splines

Avoid knot selection problem by regularization.

For all fns f with two cts derivatives minimize

RSS( f ; λ) = ∑fyi f (xi)g2 + λZ f f (t)g2 dt

i=1

First term measures closeness to data, second term penalizes curvature in f ; λ effects trade-off:

λ= 0 : f any interpolating function, (very rough)

λ= ∞ : f simple least squares ﬁt (very smooth).

Soln: natural cubic spline with knots at unique xi.

ˆ	= ( ˆ	(xi)) = Sλy. Calibrate
Linear smoother: f	f	(xi)) = Sλy. Calibrate

smoothing parameter λ via dfλ = trace(Sλ).

Pick λ by cross-validation; GCV.

Additive Models

Multiple linear regression:

E(Y jX1; : : : ; Xp) = β0 + β1X1 + : : : + βpXp

Additive model extension:

E(Y jX1; : : : ; Xp) = β0 + s1(X1)+ : : : + sp(Xp)

Estimation of s j via backﬁtting algorithm:

ˆ	1		N
1. Initialize: β0 =			∑i=1 yi; sˆ j 0		8 j.
1. Initialize: β0 =		N	∑i=1 yi; sˆ j 0		8 j.
2. Cycle: j = 1; 2; : : : ; p; : : : ; 1; 2; : : : ;
			ˆ		N
sˆ j Smooth j "fyi β0				∑ sˆk(xik)g1 #
				k j
until the sˆ j	converge.

Same generalization – replacing linear predictor with sum of smooth functions – and backﬁtting method applies to binary, count outcomes.

Prostate Cancer: Additive Model Fits

												1.5
	2
							1					1.0
	1
s(lcavol)	0					s(lweight)	0				s(age)	0.5
												0.0
	-1
							-1					-0.5
							-1
	-2
												-1.0
	-1	0	1	2	3	4		3	4	5	6
			lcavol						lweight
												1.0
	0.6
	0.4						0.5					0.5
s(lbph)	0.0 0.2					s(lcp)	0.0				s(pgg45)	0.0

-0.2	-0.5	-0.5
-0.4		1.0
		-
0.6	-1.0
-

40	50	60	70	80
		age

-1	0	1	2	-1	0	1	2	3	0	20	40	60	80	100
		lbph				lcp						pgg45

Prostate Cancer: Additive Model

	Df	NparDf	Npar F	Pr(F)
s(lcavol)	1	3	1.15	0.33
s(lweight)	1	3	1.65	0.18
s(lcp)	1	3	2.11	0.10
s(pgg45)	1	3	1.15	0.33

Initial Model:

lpsa ˜ s(lcavol) + s(lweight) + s(lcp) + s(pgg45) Final Model:

lpsa ˜ lcavol + lweight + s(lcp) + s(pgg45)

	From	To	Df Resid Df		AIC
1				80	57.5
2	s(lweight) s(lweight, 2)		2	82	56.4
3	s(lcavol)	s(lcavol, 2)	2	84	55.6
4	s(lcavol, 2)	lcavol	1	85	55.3
5	s(lweight, 2)	lweight	1	86	55.3

Tree-Structured Regression Paradigm

Tree-based methods involve four components:

1.A set of questions - splits - phrased in terms of covariates that serve to partition the covariate space. A tree structure derives from recursive splitting and a binary tree results if the questions are yes/no. The subgroups created by assigning cases according to splits are termed nodes.

2.A split f unction φ(s; g) that can be evaluated for any split s of any node g.

The split function is used to assess the worth of the competing splits.

3.A means for determining appropriate tree size.

4.Statistical summaries for the nodes of the tree.

Allowable Splits

An interpretable, ﬂexible, feasible set of splits is

obtained by constraining that

1.each split depends upon the value of only a single covariate,

2.for continuous or ordered categorical covariates,

Xj, only splits resulting from questions of the form “Is Xj c ?” for c 2 domain(Xj) are considered; thus ordering is preserved,

3.for categorical covariates all possible splits into disjoint subsets of the categories are allowed.

Growing a Tree

1.Initialize: root node comprises the entire sample.

2.Recurse: for every terminal node, g,

(a)examine all splits, s, on each covariate,

(b)select and execute (create left, gL, and right, gR, daughter nodes) the best of these splits.

3.Stopping: grow large; prune back.

4.Selection: cross-validation, test sample.

Best split determined by split function φ(s; g).

y¯ = (1=N )		∑i	2	g	y outcome average for node g.
g	g	∑i		g	i
Within node sum-of-squares: SS(g) =						∑i	2	g	(y y¯ )2.
						∑i		g	i g
Deﬁne	φ(s; g) = SS(g) SS(gL) SS(gR).

Best split s such that φ(s ; g) = maxs φ(s; g)

Easily computed via updating formulae.

Prostate Cancer: Regression Tree

			2.4780
			\|
			n=97
	lcavol<2.46165
				lcavol>2.46165
	2.1230				3.7650
	n=76				n=21
	lcavol<-0.478556			lcavol<2.79352
	lcavol>-0.478556				lcavol>2.79352
	0.6017	2.3270		3.2840	4.2030
	n=9	n=67		n=10	n=11
	lweight<3.68886
			lweight>3.68886
	2.0330			2.7120
	n=38			n=29
pgg45<7.5			lcavol<0.821736
	pgg45>7.5			lcavol>0.821736
1.7250	2.4130		2.2880	2.9360
n=21	n=17		n=10		n=19
lcavol<0.774462
lcavol>0.774462
1.2630	2.0100
n=8	n=13