Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

1Foundation of Mathematical Biology / The Elements of Statistical Learning

.pdf
Скачиваний:
48
Добавлен:
15.08.2013
Размер:
287.66 Кб
Скачать

Bias, Variance, Complexity ctd

Expected prediction error of fit fˆ(X ) at input point

X = x0 under squared error loss:

Err(x0) = E[(Y fˆ(x0))2jX = x0]

=σ2ε + [E fˆ(x0) f (x0)]2 + E[ fˆ(x0) E fˆ(x0)]2

=σ2ε + Bias2( fˆ(x0)) + Var( fˆ(x0))

=Irreducible Error + Bias2 + Variance:

First term: variance of the outcome around its true mean f (x0); unavoidable.

Second term: squared bias – amount by which average of estimate fˆ(x0) differs from true mean. Third term: variance – expected squared deviation of estimate around its mean.

Bias, Variance, Complexity ctd

For a linear model LS fit

fˆp(x) = βˆ T x we have

Err(x ) = E[(Y fˆ

(x ))2jX = x ]

0

p

0

0

= σ2

+ [E fˆp(x0) f (x0)]2 + jjh(x0)jj2σ2:

ε

 

 

ε

Here h(x0) is the weight vector producing the fit:

fˆp(x0) = x0T (XT X) 1XT y.

So, Var[ fˆp(x0)] = jjh(x0)jj2σ2ε.

While this variance changes with x0, its average (over sample values xi) is (p=N2ε.

Hence, in-sample error is

1

N

1

N

p

 

Err(xi) = σε2 +

[E fˆp(xi) f (xi)]2 +

σε2:

N

N

 

i=1

i=1

N

 

 

 

 

Here model complexity directly related to the number of parameters p – will generalize later.

Bias, Variance, Complexity ctd

Ridge regression has identical form for test error. But weights in variance term are different:

h(x0) = x0T (XT X + λI) 1XT . Bias also different.

Consider a linear model family (incl ridge regn):

β parameters of best fitting linear approx to f :

β = arg minβ EX ( f (X ) βT X )2: Squared bias is

[ f (x0) E fˆλ(x0)]2 =

[ f (x ) βT x 2 + [βT x EβT x 2

0 0] 0 λ ( 0)] :

First term: model bias – error between best fitting linear approx and true function.

Second term: estimation bias – error between the average estimate (EβTλ (x0)) and best linear approx.

Bias, Variance, Complexity ctd

For linear models, fit by LS, estimation bias = 0. For restricted fits (e.g., ridge) it is positive

– but have reduced variance.

Model bias can only be reduced by enlarging the class of linear models to a richer collection of models. Can be accomplished by inclusion of interaction terms or covariate transformations (e.g., SVMs, additive models –later).

Optimism of Training Error Secn 7.4

Training error typically less than true error.

Define the optimism as op Errin E(err):

For squared error and other loss functions have

2

N

 

 

op =

N

Cov(yˆ

; y )

 

 

 

 

 

i

i

 

 

i=1

 

 

) the amount by which err underestimates the true error depends on how strongly yi affects its own prediction. The harder we fit the data, the greater Cov(yˆi; yi), thus increasing the optimism.

If yˆi is from a linear fit with

p covariates

 

 

Cov(yˆi; yi) = pσε2

p

 

 

 

 

 

 

2

 

 

 

 

 

so

Errin =

E(err)+ 2

 

σε

N

Estimation of Prediction Error Secn 7.5

General form of in-sample estimates is

c

c

Errin = err +op:

Applying to linear model with p parameters fit under squared error loss gives the Cp statistic:

 

=

 

+

 

 

p

ˆ 2

Cp

err

2

N

 

 

 

σε:

Here σˆ 2ε is an estimate of the error variance obtained from a low-bias (large) model. Under this criterion we adjust the training error by a factor proportional to the number of covariates used.

Akaike Information Criterion is a generalization to situation where a log-likelihood loss function is used, e.g., binary, Poisson regression.

Criterion Selection Functions

Generic form for AIC is

AIC = 2 loglik +2 p

Bayes information criterion (BIC) (Secn 7.7) is

BIC = 2 loglik +log N p

For N > e2 7:4, BIC penalty > AIC penalty

) BIC favors simpler models.

Many variants; new feature – adaptive penalties.

When log-lik based on normal distn we require an estimate for σ2ε. Typically obtained as mean squared error of low-bias model ) problematic. Cross-validation does not require this.

Effective Number of Parameters Secn 7.6

The Cp or AIC criteria have an optimism estimate (penalty) that involves number of parameters p.

If covariates are selected adaptively then no longer have Cov(yˆi;yi) = pσ2ε; e.g., total of p covariates and we select the best-fitting model with q < p covariates, optimisim will exceed (2q=N2ε.

By choosing best-fitting model with q covariates, the effective number of parameters is > q.

Linear fitting methods: yˆ = Sy where S is N N matrix depending only on covariates xi (not yi). Includes regression, methods using quadratic penalties such as ridge, cubic smoothing splines. Define enp as d(S) = trace(S).

Cross-Validation Secn 7.10

Simplest method for estimating prediction error.

Estimates extra-sample error Err = E[L(Y; fˆ(X )].

With enough data (large N) set aside portion as validation set. Use to assess model performance.

Not feasible with small N ) CV offers a finesse.

Randomly partiton data into K equal-sized parts.

For kth part, fit model to other K 1 parts. Then calculate prediction error of resultant model when applied to kth part. Do this for k = 1; : : : ; K and combine the prediction error estimates.

Let κ : f1; : : : ; Ng 7!1; : : : ; K map observations to their assigned partition. Let fˆ k(x) denote fitted

function with kth part removed.

κ(i)(xi; α)):

Cross-Validation ctd

Then CV prediction error estimate is

1

N

 

 

 

 

 

 

 

ˆ

κ(i)(

 

CV =

L(y

; f

 

x )):

 

N

i

 

 

i

 

 

i=1

 

 

 

 

Given a set of models f (x; α) indexed by tuning parameter α (e.g., ridge, lasso, subset, spline) set

1 N

CV(α) = N L(yi; fˆ

i=1

Find αˆ minimizing CV(α) and fit chosen model f (x; αˆ ) to all the data.

K = N: leave-one-out CV – approx unbiased for true prediction error but can be highly variable.

K = 5: lower variance but bias can be a problem.

Generally K = 5 or 10 recommended but clearly depends on N ) microarray applications??