Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

vstatmp_engl

.pdf
Скачиваний:
12
Добавлен:
12.03.2016
Размер:
6.43 Mб
Скачать

7.1 Likelihood and Information

187

[36, 37, 38, 39]. Some of the early promoters (Barnard, Birnbaum) of the LP later came close to rejecting it or to restrict its applicability. The reason for the refusal of the LP has probably its origin in its incompatibility with some concepts of the classical statistics. A frequently expressed argument against the LP is that the confidence intervals of the frequentist statistics cannot be derived from the likelihood function alone and thus contradict the LP. But this fact merely shows that certain statistical methods do not use the full information content of a measurement or / and use irrelevant information. Another reason lies in problems applying the LP outside the natural sciences like in medicine or biology. There it is often not possible to parameterize the empirical models in a stringent way. But uncertainties in the model prohibit the application of the LP. The exact validity of the model is a basic requirement for the application of the LP.

In the literature examples are presented which are pretended to contradict the LP. These examples are not really convincing and rather strengthen the LP. Anyway, they often contain quite exotic distributions which are irrelevant in physics applications and which lead when treated in a frequentist way to unacceptable results [39].

We abstain from a reproduction of the rather abstract proof of the LP and limit us to present a simple and transparent illustration of it:

The quantity which contains all the information we have on θ after the measurement is the p.d.f. of θ,

g(θ) =

R

L(θ|x)π(θ)

.

 

L(θ|x)π(θ)dθ

It is derived from the prior density and the likelihood function. The prior does not depend on the data, thus the complete information that can be extracted from the data, and which is relevant for g(θ), must be contained in the likelihood function.

A direct consequence of the LP is that in the absence of prior information, optimal parameter inference has to be based solely upon the likelihood function. It is then logical to select for the estimate the value of the parameter which maximizes the likelihood function and to choose the error interval such that the likelihood is constant at the border, i.e. is smaller everywhere outside than inside. (see Chap. 8). All approaches which are not based on the likelihood function are inferior to the likelihood method or at best equivalent to it.

7.1.4 Bias of Maximum Likelihood Results

In frequentist statistics an important aspect of point estimation is the evaluation of

the bias of an estimator. The bias b of an estimate

ˆ

θ is the deviation of its expectation

value from the true value θ of the parameter:

 

ˆ

 

b = E(θ) − θ .

 

Example 107. Bias of the estimate of a decay parameter

We estimate the decay parameter γ from 5 observed decays of an unstable particle. We have seen in a previous example that the MLE γˆ is the inverse of the average of the individual decay times, γˆ = 1/t. The mean value t follows a gamma distribution (see Sect. 3.6.8).

f(t|γ) = (5γ)5 t4 exp(−5γt) , 4!

188 7 Parameter Inference II

and thus the expectation value E(ˆγ) of γˆ is

E(ˆγ) = Z0γfˆ (

 

|γ) d

 

 

 

 

 

 

t

t

 

 

 

)5

 

3

 

 

 

5

 

 

t

 

= Z0

(5γ

 

 

exp(−5γt¯) dt =

 

γ .

4!

 

 

4

When in a large number of similar experiments with 5 observed events the MLE of the decay time is determined then the arithmetic mean di ers from the true value by 25 %, the bias of the MLE is b = E(ˆγ) − γ = γ/4. For a single decay the bias is infinite.

With increasing number of observations the bias decreases, the MLE is a consistent estimator (see Appendix 13.2.2).

Biases occur quite frequently at small samples. The word bias somehow suggests that something is wrong and thus it appears quite disturbing at first sight that estimates may be systematically biased. In fact in most of the conventional statistical literature it is recommended to correct for the bias. However, there is no obvious reason for a correction and a closer study reveals that bias corrections lead to di -

culties when we combine di erent measurements ˆ in the usual way, weighting the

θi

results by the inverse covariance matrix, or in the one dimensional case according to (4.7) simply by the inverse error squared

 

 

ˆ 2

 

P

θii

 

 

1/δi2 .

θ = P

Since usually the estimated errors depend on the value of the MLE, the weighting introduces a bias which often partially compensates a bias of the MLE.

Let us resume our last example and assume that many experiments measure the decay rate from sample sizes of N = 5. The estimates γˆi will vary from experiment

to experiment. Each experiment will apart from the estimate evaluate the error

δi which will turn out to be proportional to γˆi, namely δi = γˆi/ 5. Averaging without bias correction according to our prescription, we will obtain E(γ) = 5/6 γ, thus the bias is reduced, while averaging the bias corrected estimates would lead to E(γ) = 2/3 γ, a result which is considerably worse. Anyway, the bias correction is rather unsatisfactory for a small number of observations, for N = 1 the bias is infinite and thus the corrected value would be zero!

As we have seen, likelihood estimates have the desirable property that they are invariant against parameter transformations. For instance, the estimates for the rate γˆ and the mean life τˆ obey the same relation as the true values, namely γˆ = 1/τˆ while the corresponding relation is not valid for the bias corrected quantities. Generally, the bias is not invariant against parameter transformations and thus its correction is problematic. The requirement of an unbiased estimate is in conflict with the parametrization invariance.

Contrary to many colleagues, we argue that a systematic bias correction of the estimate is baneful. It is in contradiction with the LP and should be applied only in well-founded situations. It is much more sensible to present the full likelihood function or to specify asymmetric errors since the exact procedure to combine measurements is to add the log-likelihoods. We will come back to this point in Chap. 8 and discuss the virtue of bias corrections again in Appendix 13.6.

7.1 Likelihood and Information

189

 

4

 

 

 

 

 

3

 

 

 

 

Likelihood

2

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

0

0.0

0.5

1.0

1.5

 

 

 

 

 

 

q

 

 

 

 

 

x /

 

Fig. 7.1. Likelihood function of the width of a uniform distribution for 12 observations.

For a similar reason it is dangerous to interpret the likelihood function as p.d.f. of the parameter. The following example illustrates this fact.

Example 108. Bias of the estimate of a Poisson rate with observation zero

We search for a rare decay but we do not observe any. The likelihood for the mean rate λ is according to the Poisson statistic

L(λ) = eλλ0 = eλ . 0!

When we normalize the likelihood function to obtain the Bayesian p.d.f. with

a uniform prior, we obtain the expectation value h i while the value ˆ

λ = 1 λ = 0 corresponds to the maximum of the likelihood function.

(It may seem astonishing, that an expectation value one follows from a nullmeasurement. This result is a consequence of the assumption of a uniform distribution of the prior which is not unreasonable because had we not anticipated the possibility of a decay, we would not have performed the measurement. Since also mean rates di erent from zero may lead to the observation zero it is natural that the expectation value of λ is di erent from zero.)

Now if none of 10 similar experiments would observe a decay, a naive averaging of the expected values alone would again result in a mean of one, a crazy value. Strictly speaking, the likelihoods of the individual experiments should be multiplied, or, equivalently the null rate would have to be normalized to ten times the original time with the Bayesian result 1/10.

We study a further example.

Example 109. Bias of the measurement of the width of a uniform distribution

190 7 Parameter Inference II

Let x1, . . . , xN be N observations of a sample following a uniform distribution f(x) = 1/θ with 0 < x < θ. We search for the value of the parameter θ. Figure 7.1 shows the observations and the likelihood function for N = 12. The likelihood function is

L = 0 for θ < xmax ,

1

= θN for θ ≥ xmax .

Obviously, the likelihood has its maximum when θ coincides with the largest

observation of the sample: ˆ . (Here is a su cient statistic.)

xmax θ = xmax xmax

At smaller values of x, the likelihood is zero. The estimate is biased towards small values. Given a sample size of N, we obtain N + 1 gaps between the observations and the borders [0, θ]. The average distance of the largest ob-

servation from thus is . The bias is −ˆ . There is no reason

θ θ/(N + 1) θ/N

to correct for the bias. We rather prefer to present the biased result with a

one-sided error

θ = x +xmax/N max −0

or, alternatively, the full likelihood function.

A further, more general discussion of the bias problem is given in Appendix 13.6.

7.1.5 Stopping Rules

An experiment searches for a rare reaction. Just after the first successful observation at time t the experiment is stopped. Is the resulting rate 1/t systematically biased? Do we have to consider the stopping rule in the inference process? The answer is “no” but many people have a di erent opinion. This is the reason why we find the expression stopping rule paradox in the literature.

The possibility to stop an experiment without compromising the data analysis, for instance because a detector failed, no money was left or because the desired precision has been reached, means a considerable simplification of the data analysis.

The validity of the independence of the parameter estimation on a large class of stopping conditions is a direct consequence of the LP because the likelihood function of a parameter determined from sequentially obtained data does not depend on stopping rules.

In this context we examine a simple example.

Example 110. Stopping rule: four decays in a time interval

In two similar experiments the lifetime of the same instable particle is measured. In experiment A the time interval t is fixed and 4 decays are observed. In experiment B the time t is measured which is required to observe 4 decays. Accidentally the two times coincide. Thus in both experiments 4 decays are registered in the time interval t but in experiment A the number n of decays is the random variable whereas in experiment B it is the time t. Do both experiments find the same rate, namely τ = t/4 and the same error interval? We could think “no” because in the first experiment the fourth decay has happened earlier than in the second. The likelihood functions for the two

7.2 Further Methods of Parameter Inference

191

situations are deduced for experiment A from the Poisson distribution and for experiment B from the exponential time distribution:

LA(θ|n) = P (n|θt)

= eθt(θt)4 θ4eθt , 4!

LB(θ|t) = θ4eθt LA(θ|n) .

The likelihood functions are equal up to an irrelevant factor and consequently also the results are the same. The stopping rule does not influence the analysis. The only relevant data are the number of decays and the time interval.

The fact that an arbitrary sequential stopping rule does not change the expectation value is illustrated with an example given in Fig. 7.2. A rate is determined. The measurement is stopped if a sequence of 3 decays occurs within a short time interval of only one second. It is probable that the observed rate is higher than the true one, the estimate is too high in most cases. However, if we perform many such experiments one after the other, their combination is equivalent to a single very long experiment where the stopping rule does not influence the result and from which we can estimate the mean value of the rate with high precision. Since the log-likelihood of the long experiment is equal to the sum of the log-likelihoods of the short experiments, the log-likelihoods of the short experiments obviously represent correctly the measurements. The stopping rule does not enter into the likelihood function and therefore must be irrelevant.

Why does the fact that neglecting the stopping rule is justified contradict our intuition? Well, most of the sequences indeed lead to too high rates but when we combine measurements the few long sequences get a higher weight and they tend to produce lower rates, and the average is correct. On the other hand, one might argue that the LP ignores the information that in most cases the true value of the rate is lower than the MLE. This information clearly matters if we would bet on this property, but it is irrelevant for estimating the probability density of the parameter value. A bias correction would improve the not very precise estimate somewhat for small sequences but be very unfavorable for the fewer but more precise long sequences and for a single experiment we cannot know whether we have a short or a long sequence. In fact, in exploratory experiments like those in particle physics or astrophysics we are always concerned with a single experiment. (see also Appendix 13.6).

7.2 Further Methods of Parameter Inference

In the previous sections we have discussed in some detail the maximum likelihood method. Another very popular method is the least square method which we can derive from the maximum likelihood method. A further approach to estimate unknown parameters of a distribution is the moments method which we will treat briefly in the following section.

7.2.1 The Moments Method

The moments of a distribution which depends on a parameter θ usually also depend on θ:

192 7 Parameter Inference II

Fig. 7.2. An experiment is stopped when 3 observations are registered within a short time interval (indicated by a box). An arbitrarily long experiment can be subdivided into many such sub-experiments following the stopping rule.

Z

µn(θ) =

xnf(x|θ) dx .

(7.4)

The empirical moments

1

Xi

 

 

µˆn =

xin ,

 

 

 

N

 

e.g. the sample mean or the mean of squares, which we can extract trivially for a sample, are estimators of the moments of the distribution. From the inverse function µ−1 we obtain a consistent estimate of the parameter,

ˆ −1

θ = µ (ˆµ) ,

because according to the law of large numbers we have (see Appendix 13.1)

lim P {|µˆ − µ| > ε} = 0 .

N→∞

It is clear that any function u(x) for which expected value and variance exist, and where hui is an invertible function of θ, can be used instead of xn. Therefore the method is somewhat more general than suggested by its name.

If the distribution has several parameters to be estimated, we must use several moments or expected values, approximate them by empirical averages, and solve the resulting system of – in general non-linear – equations for the unknown parameters.

The estimators derived from the lower moments are usually more precise than those computed from the higher ones. Parameter estimation from the moments is usually inferior to that of the ML method. Only if the moments used form a su cient statistic, the two approaches produce the same result.

The uncertainties of the fitted parameters have to be estimated from the covariance matrix of the corresponding moments and subsequently by error propagation or alternatively by a Monte Carlo simulation, generating the measurement several

7.2 Further Methods of Parameter Inference

193

times. Also the bootstrap method which will be introduced in Chap. 12, can be employed. Sometimes the error calculation is a bit annoying and reproduces the ML error intervals only in the large sample limit.

Example 111. Moments method: mean and variance of the normal distribution

We come back to the example from Sect. 6.5.2 For a sample {x1, . . . , xN } following the distribution

 

1

−(x − µ)2/(2σ2) .

f(x|µ, σ) =

σexp

We determine independently the parameters µ and σ. We use the abbreviations x for the sample mean and x2 for the mean of the squares and v2 = (x − x)2 = x2 − x2 for the empirical variance.

The relation between the moment µ1 and µ is simply µ1 = µ, therefore

µˆ = µˆ1 = x .

In Chap. 3, we have derived the relation (3.15) v2 = σ2(N − 1)/N between the expectation of the empirical variance and the variance of the distribution;

inverting it, we get

 

 

 

σˆ = vr

 

 

 

 

 

 

N

 

.

 

N

1

 

 

 

 

 

The two estimates are uncorrelated. The error of µˆ is derived from the esti-

mated variance

δµ = √σˆ , N

and the error of σˆ is determined from the expected variance of v. We omit the calculation, the result is:

δσ = p σˆ .

2(N − 1)

In the special case of the normal distribution, the independent point estimates of µ and σ of the moments method are identical to those of the maximum likelihood method. The errors di er for small samples but coincide in the limit N → ∞.

The moments method has the advantage of being very simple, especially in the case of distributions which depend linearly on the parameters – see the next example below:

Example 112. Moments method: asymmetry of an angular distribution

Suppose we have to determine from a sample the asymmetry parameter α of a distribution f(x) = (1 + αx)/2 linear in x = cos β. The first moment of the distribution isPµ1 = α/3. Thus we can compute the parameter from the sample mean x = xi/N:

αˆ = 3 x .

194 7 Parameter Inference II

The mean square error from an individual measurement x is proportional to the variance of the distribution:

var(αˆ) = 9 var(x) = 3 − α2 .

(7.5)

Using instead of α its estimate, we get

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

r

3 − 9

 

2

.

δ

αˆ

= 3 δ

 

=

x

x

 

 

 

 

N

A likelihood fit, according to the likelihood principle, is more accurate and reflects much better the result of the experiment which, because of the kinematical limits3 |α| < 1, cannot be described very well by symmetric errors; especially when the sample size is small and the estimate happens to lie near the boundary. In this case the maximum likelihood method should be applied.

An indication of the relative precision of the estimate moments method as compared to the MLE is provided by a comparison of the asymptotic e ciencies. In the asymptotic limit N → ∞ the variance of the moments estimate αˆ does not approach the limiting value given by the Cramer–Rao inequality, see (13.6) in Appendix 13.2, which can be shown to be achieved by the MLE:

 

α2

 

1

 

1

+ α

−1

var(αˆML) ≈

 

 

ln(

 

 

) − 1 .

N

1

− α

A comparison with (7.5) shows that the asymptotic e ciency of the moments

method, defined as

ε = var(αˆML) , var(αˆ)

is unity only for α = 0. It is 0.92 for α = 0.5 and drops to 0.73 for α = 0.8. (At the boundary, |α| = 1 the Cramer–Rao relation cannot be applied.)

Note that the p.d.f. of our example is a special case of the usual expansion of an angular distribution into Legendre polynomials Pl(cos β):

XL

f(x|θ) = (1 + θlPl(x))/2 .

l=1

From the orthogonality of the Pl with the usual normalization

1

 

2

 

Z−1

Pl(x)Pm(x)dx =

δl,m

 

2l + 1

it is easy to see that θl = (2l + 1)hPli. In the case l = 1, P1 = x, this is the first moment of the distribution and we reproduce µ1 = α/3.

The moments method can also be used to provide start values for approximations of likelihood estimators which we discussed in Sect. 6.8.

3These limits stem from the obvious requirement that the p.d.f. has to be positive.

7.2 Further Methods of Parameter Inference

195

y

x

Fig. 7.3. Fit of a curve to measurements.

7.2.2 The Least Square Method

A frequently occurring problem is that measured points and error margins are given through which we want to fit a curve as shown in Fig. 7.3. The curve y = t(x, λ) be measured at N positions xi, yielding the values yi(xi). The standard solution of this so-called regression problem is provided by the least square method which fixes parameters of a given function by minimizing the sum of the normalized square deviations of the function from measured points.

Given N measured points xi, yi ± δi, and a function t(x, θ), known up to some free parameters θ, the latter are determined such that

χ2 =

N

(yi − t(xi, θ))2

(7.6)

δ2

X

i=1 i

takes its minimal value.

The least square method goes back to Gauss. Historically it has successfully been applied to astronomical problems and is still the best method we have to adjust parameters of a curve to measured points if only the variance of the error distribution is known. It is closely related to the likelihood method when the errors are normally distributed. Then we can write the p.d.f. of the measurements in the following way:

 

N

(y

 

ti(xi, θ))2

 

 

X

 

 

 

 

f(y1, . . . , yN |θ) exp −

i,j=1

 

i

i2

,

 

 

 

 

 

and the log-likelihood is

ln L(θ y) =

1

N

(yi − ti(xi, θ))2

,

|

 

X

 

 

2

i,j=1

δi2

 

 

 

 

 

 

196 7 Parameter Inference II

time

channel

Fig. 7.4. χ2−Fit (dashed) of straight line through digital measurements.

= −

1

χ2 .

(7.7)

2

Thus minimizing χ2 is equivalent to maximizing the likelihood if the errors are normally distributed, a condition which frequently is at least approximately satisfied. From (7.7) we conclude that the standard deviation errors of the parameters in a least square fit correspond to one unit, Δχ2 = 1, twice the value 1/2 of the maximum likelihood method. In Sect. 3.6.7 we have seen that χ2 follows a χ2 distribution of f = N − Z (Z is the number of free parameters) degrees of freedom, provided the normality of the errors is satisfied. Thus we expect χ2 to be of the order of f, large values indicate possible problems with the data or their description. We will investigate this in chapter 10.

That the least square method can lead to false results if the condition of Gaussian uncertainties is not fulfilled, is illustrated in the following example.

Example 113. Counter example to the least square method: gauging a digital clock

A digital clock has to be gauged. Fig. 7.4 shows the time channel as a function of the true time and a least square fit by a straight line. The error bars in the figure are not error bars in the usual sense but indicate the channel width. The fit fails to meet the allowed range of the fifth point and therefore is not compatible with the data. All straight lines which meet all “error bars” have the same likelihood. One correct solution is indicated in the figure.

We can easily generalize the expression (7.6) to the case of correlated errors. Then

we have

XN

χ2 = (yi − ti)Vij (yj − tj )

i,j=1

where V, the weight matrix, is the inverse of the covariance matrix. The quantity χ2 is up to a factor two equal to the negative log-likelihood of a multivariate normal

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]