Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

1Foundation of Mathematical Biology / Foundation of Mathematical Biology

.pdf
Скачиваний:
45
Добавлен:
15.08.2013
Размер:
2.11 Mб
Скачать

UCSF

Homework: Due November 6th

 

 

 

Problem 1: You wish to design an experiment that will identify a gene that is highly expressed in cervical scrapes from women with cervical disease but that is poorly expressed in samples from women free of disease.

A preliminary experiment gives you 40,000 mean gene expression values from normal cervical samples measured versus a pooled reference RNA of which you have a large quantity. These values are

G1…G40,000.

For your initial experiment, to prove feasibility, you have 5 disease samples and 5 normal samples. You will measure gene expression in these versus the same pooled reference as above. This will yield

1…5D1…40,000 and 1…5F1…40,000 corresponding

to the gene expression values for diseased and disease-free samples.

Mathematically define the characteristics of a gene that would serve as a good disease marker.

Given that we have 40,000 variables and just 10 samples, is there a possibility that you will find a statistically defensible result to support your grant?

Construct an example of data values to defend your answer.

How can you use permutation analysis to quantify the significance of your best nominal gene?

UCSF

Homework: Due November 6th

 

 

 

Problem 2: You take the best nominal gene from Problem 1 (significant or not) and are to design a follow-up experiment to test its utility as a marker. This is not a clinical trial design, but rather a step toward that. You wish only to confirm that expression is higher in the disease samples than in the normal samples.

Suppose that the for the disease samples, Gene X had mean expression of 7.0 and sample standard deviation 1.2. For the normal samples, Gene X had mean expression 5.3 with SD 1.1.

Assume that you will use equal numbers of disease and normal samples in this followup experiment.

How many new samples of cancer and normal do you estimate you will need in order to obtain a result suggesting that the gene expression difference is significant at p = 0.05? (Hint: You may need to use the table of the distribution of t from Lecture I.)

If you used a parametric statistic for above, is there a way to reduce the number of samples required by instead using a nonparametric test? Please make a statistical or probabilistic argument to support your answer.

Suppose you decide to follow up on the best 100 genes from your initial experiment and you want to see if any of them are significant. Does this affect your sample size calculation? How can you use permutation analysis on your preliminary data to make a good estimate of how many samples you would need?

UCSF

Homework: Due November 6th

 

 

 

Problem 3: Is there any theoretical difficulty with performing an experiment and then testing a large number of different statistics and picking the one that suggests a significant result?

Problem 4: Suppose you have a pathological case for Pearson’s r: 9 points where X and Y are chosen from standard normal distributions and one outlier point at (1000,1000). Quite often, you will observe an anomalously high r, which will appear significant according to statistical theory.

Do you expect that a permutationbased method will give you a more pessimistic estimate of significance in this case?

Do you expect a significant difference between a pure permutation approach versus a resampling approach (with replacement)?