1Foundation of Mathematical Biology / The Elements of Statistical Learning
.pdfGene Harvesting
Hastie, Tibshirani, Botstein, Brown (2001). genomebiology.com/2001/2/1/research
First cluster genes using hierarchical clustering.
Obtain average expression profiles from all clusters. These serve as potential covariates, in addition to individual genes.
The use of clusters as covariates biases toward correlated sets of genes; reduces overfitting.
Forward stepwise algorithm; prescribed # terms.
Provision for interactions with included terms.
Model choice (# terms) via cross-validation.
5.5 |
6.0 |
6.5 |
||
|
|
|
|
|
7
9
5
6
8
1
4
2
10
3
11
Linkage Single
6
8
5.0 |
5.5 |
6.0 |
6.5 |
7.0 |
7.5 |
8.0 |
|
||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
|
|
|
|
|
|
|
|
|
|
|
|
Hierarchical |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
3 |
|
9 |
|
|
|
|
|
|
|
|
|
|
|
Average |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
10 |
|
|
|
|
|
|
|
|
|
|
|
Linkage |
Clustering |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
1 |
|
|
5 |
|
|
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Kappa Opioid / Harvesting / Average Linkage
Step Node Parent Score Size
1 |
6295 |
0 |
22.40 |
687 |
2 |
1380 |
6295 |
19.67 |
6 |
3 |
663 |
0 |
15.62 |
2 |
4 |
3374 |
663 |
10.69 |
3 |
5 |
1702 |
0 |
12.92 |
2 |
6 |
6268 |
663 |
11.27 |
83 |
|
|
|
|
|
|
|
|
|
|
y = β + β x¯ + β (x¯ x¯
0 1 Node6295 2 Node1380 Node6295)+
Kappa Opioid / Harvesting / Single Linkage
Step |
Node Parent Score Size |
|||
|
|
|
|
|
1 |
g3655 |
0 |
21.97 |
1 |
2 |
2050 |
g3655 |
20.62 |
3 |
3 |
g900 |
g3655 |
16.91 |
1 |
4 |
g1324 |
g3655 |
16.01 |
1 |
5 |
g1105 |
g3655 |
24.34 |
1 |
6 |
g230 |
g3655 |
12.44 |
1 |
|
|
|
|
|
y = β + β x + β (x¯ x
0 1 Gene3655 2 Node2050 Gene3655)+
Kappa Opioid: 5-fold CV Error Variance
|
4*10^6 |
|
|
|
|
|
|
|
|
|
|
|
|
Clustered Genes |
|
|
3*10^6 |
|
|
|
|
Original Genes |
|
|
|
|
|
|
Training Error |
|
|
|
|
|
|
|
|
|
|
Residual Variance |
2*10^6 |
|
|
|
|
|
|
|
10^6 |
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
|
|
|
|
Terms |
|
|
|
Gene Harvesting: Kappa-Opioid
100 |
|
|
|
|
|
|
|
80 |
|
|
|
|
|
|
|
60 |
|
|
|
|
|
|
|
40 |
|
|
|
|
|
|
|
20 |
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
-0.4 |
-0.2 |
0.0 |
0.2 |
0.4 |
0.6 |
0.8 |
1.0 |
Correlations: Node 6295
Gene Harvesting: Kappa-Opioid
200 |
|
|
|
|
|
|
|
150 |
|
|
|
|
Node score = 22.4! |
|
|
|
|
|
|
|
|
|
|
100 |
|
|
|
|
|
|
|
50 |
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
0 |
2 |
4 |
6 |
8 |
10 |
12 |
14 |
Scores: Node 6295
Kappa Opioid: 10-fold CV Error Variance
500000 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Constrained Harvesting |
||
|
|
|
|
|
|
||
|
|
|
|
|
|
Training Error |
|
400000 |
|
|
|
|
|
Variance |
300000 |
|
|
|
|
|
Residual |
200000 |
|
|
|
|
|
|
100000 |
|
|
|
|
|
|
0 |
|
|
|
|
|
|
1 |
2 |
3 |
4 |
5 |
6 |
|
|
|
|
Terms |
|
|
Smoothing
Recall simple linear model: E(Y jX ) = β0 + β1X
Dependence of E(Y ) on X not necessarily linear.
Can extend model by adding terms, e.g., X 2
) problematic: what terms? when to add?
What is desirable is to have
1.the data dictate appropriate functional form without imposing rigid parametric assumptions,
2.a corresponding automated fitting procedure.
Key concepts: locally determined fit .
Issues: what is local? how to fit?
Resultant methods: (scatterplot) smoothers.
Resultant model: E(Y jX ) = β0 + s(X ; λ)
5
4
log(PSA) |
3 |
|
2 |
1
0
|
|
|
|
|
|
|
|
• |
|
|
• |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
span = 10% |
|
|
|
|
|
|
|
• |
|
|
|
|
|
|
|
|
|
|
|
||
|
|
span = 25% |
|
|
|
|
|
• |
|
|
|
|
|
span = 100% |
|
|
|
|
|
|
|
||
|
|
|
|
|
• |
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
• |
|
|
|
|
• |
|
|
|
|
|
• |
• |
|
|
|
|
|
|
|
|
|
||
• |
|
|
• |
• |
• |
|
|
• |
• |
|
|
|
|
• • |
|
|
|
• |
|||||
• |
|
|
|
|
|
• |
• |
||||
|
|
|
• |
|
|
|
|
• |
|
||
|
|
|
• |
|
|
• |
|
|
|
|
|
• |
• |
• |
|
|
• |
|
|
|
|
||
• |
|
|
|
|
• |
• • |
|
|
|||
• |
|
• |
|
• |
|
|
• |
|
|
||
• |
• |
• |
• |
|
|
|
• |
|
|
• |
|
|
|
|
|
|
|
||||||
• |
|
|
• |
|
|
|
|
|
|||
• |
|
• |
• |
|
|
|
• |
• |
|
|
|
• |
|
|
|
|
|
|
|
|
|
||
• |
• |
|
|
|
|
|
|
|
|
|
|
• |
|
• |
|
|
|
|
• |
|
|
|
|
• |
• |
|
|
|
|
|
|
|
|
||
• |
• |
|
|
|
|
• • |
|
|
|
|
|
• |
• |
|
• |
|
|
|
|
|
|
||
• |
• |
|
|
|
|
|
|
|
|
||
• |
• |
|
|
|
|
|
|
|
|
|
|
• |
|
|
|
|
|
|
|
|
|
|
•
•
•
•
•
•
-1 |
0 |
1 |
2 |
3 |
|
|
log (Capsular Penetration) |
|
|