vstatmp_engl
.pdf348 13 Appendix |
|
N |
|
ln L(θ|x) = ln f(xi|θ) |
(13.11) |
=1 |
|
Xi |
|
which is to be calculated by integration over the variables3 x using the true p.d.f. (with the true parameter θ0). First we prove the inequality
|
|
|
hln L(θ|x)i < hln L(θ0|x)i , |
|
(13.12) |
||||
for θ 6= θ0: Since |
the logarithm is |
a strongly convex function, there is |
always |
||||||
hln(. . .)i < lnh(. . .)i, hence |
|
|
|
|
|
||||
ln |
L(θ x) |
|
L(θ x) |
|
L(θ x) |
|
|
||
| |
< ln |
| |
|
= ln Z |
| |
L(θ0 |
|x)dx = ln 1 = 0 . |
|
|
L(θ0|x) |
L(θ0|x) |
L(θ0|x) |
|
In the last step we used
ZZ Y
L(θ|x)dx = |
f(xi|θ)dx1 · · · dxN = 1 . |
P
Since ln L(θ|x)/N = ln f(xi|θ)/N is an arithmetic sample mean which, according to the law of large numbers (13.2), converges stochastically to the expected value for N → ∞, we have also (in the sense of stochastic convergence)
X
ln L(θ|x)/N → hln f(x|θ)i = hln f(xi|θ)i /N = hln L(θ|x)i /N ,
and from (13.12)
Nlim P {ln L(θ|x) < ln L(θ0|x)} = 1 , θ 6= θ0 . |
(13.13) |
→∞ |
|
On the other hand, the MLE ˆ is defined by its extremal condition
θ
ˆ |
≥ ln L(θ0|x) . |
|
|
|
|
|
ln L(θ|x) |
|
|
|
|
|
|
A contradiction to (13.13) can be avoided only, if also |
|
|
|
|
|
|
ˆ |
− θ0| < ε} = 1 |
|
|
|
|
|
Nlim P {|θ |
|
|
|
|
|
|
→∞ |
|
|
|
|
|
|
is valid. This means consistency of the MLE. |
|
|
|
|
|
|
13.3.2 E ciency |
|
|
|
|
|
|
Since the MLE is consistent, it is unbiased asymptotically |
for N |
→ ∞ |
. Under certain |
|||
|
4 |
|
|
|||
assumptions in addition to the usually required regularity |
|
the MLE is also e cient |
asymptotically.
Proof :
3We keep the form of the argument list of L, although now x is not considered as fixed to the experimentally sampled values, but as a random vector with given p.d.f..
4The boundaries of the domain of x must not depend on θ and the maximum of L should not be reached at the boundary of the range of θ.
13.3 Properties of the Maximum Likelihood Estimator |
349 |
expected value and variance of y = |
|
yi = ∂ ln L/∂θ Q |
|
|||||||
expressions: |
hyi = Z |
P |
|
|
|
|
|
|
||
|
|
ln L |
|
|
|
|
|
|||
|
|
∂ |
|
L dx = 0 , |
(13.14) |
|||||
|
|
∂θ |
|
|||||||
|
|
∂ ln L |
|
2 |
|
∂2 |
|
|||
σy2 |
= var(y) = * |
|
|
+ |
= − |
|
ln L . |
(13.15) |
||
|
∂θ |
∂θ2 |
With the notations of the last paragraph with L = fi and using (13.8), the are given by the following
The last relation follows after further di erentiation of (13.14) and from the relation
|
∂ |
2 ln L |
|
∂ ln L ∂L |
∂ ln L ∂ ln L |
|
|||||
Z |
|
L dx = − Z |
|
|
|
dx = − Z |
|
|
|
L dx . |
|
|
∂θ2 |
∂θ |
∂θ |
∂θ ∂θ |
From the Taylor expansion of ∂ ln L/∂θ|θ=θˆ which is zero by definition and with |
||||||||
(13.15) we find |
|
|
|
|
|
|
|
|
|
∂ ln L |
|θ=θˆ ≈ |
∂ ln L |
|θ=θ0 |
ˆ |
− θ0) |
∂2 ln L |
|θ=θ0 |
0 = |
|
|
+ (θ |
|
||||
∂θ |
∂θ |
∂θ2 |
||||||
|
|
|
ˆ |
2 |
|
|
(13.16) |
|
|
|
≈ y − (θ − θ0)σy , |
|
|
where the consistency of the MLE guaranties the validity of this approximation in the sense of stochastic convergence. Following the central limit theorem, y/σy being the sum of i.i.d. variables, is asymptotically normally distributed with mean zero and
variance unity. The same is then true for |
ˆ |
− |
θ |
|
)σ |
ˆ |
follows |
asymptotically a |
(θ |
0 |
θ |
||||||
|
|
y, i.e. |
|
2 |
||||
normal distribution with mean θ0 and asymptotically vanishing variance 1/σy 1/N, |
||||||||
as seen from (13.9). |
|
|
|
|
|
|
|
|
13.3.3 Asymptotic Form of the Likelihood Function |
|
|
||||||
|
|
|
|
|
|
|
|
ˆ |
A similar result as derived in the last paragraph for the p.d.f. of the MLE θ can be |
||||||||
derived for the likelihood function itself. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ˆ |
If one considers the Taylor expansion of y = ∂ ln L/∂θ around the MLE θ, we get |
||||||||
ˆ |
|
|
|
|
|
|
|
|
with y(θ) = 0 |
|
|
|
|
|
|
|
|
y(θ) ≈ (θ − θˆ)y′(θˆ) . |
|
|
(13.17) |
As discussed in the last paragraph, we have for N → ∞
′ ˆ → ′ → h ′i − 2
y (θ) y (θ0) y = σy = const .
Thus ′ ˆ is independent of ˆ and higher derivatives disappear. After integration of y (θ) θ
(13.17) over θ we obtain a parabolic form for ln L: |
|
|||
ˆ |
1 |
2 |
ˆ |
2 |
ln L(θ) = ln L(θ) − |
2 |
σy |
(θ − θ) , |
where the width of the parabola decreases with σy−2 1/N (13.9). Up to the missing normalization, the likelihood function has the same form as the distribution of the
ˆ |
− θ0 |
ˆ |
MLE with θ |
replaced by θ − θ. |
13.5 Frequentist Confidence Intervals |
353 |
Example 149. Parameter uncertainty for background contaminated signals
We investigate how well our asymptotic error formula works in a specific example. To this end, we consider a Gaussian signal distribution with width unity and mean zero over a background modeled by an exponential distribution with decay constant γ = 0.2 of the form c exp[−γ(x + 4)] where both distributions are restricted to the range [−4, 4]. The numbers of signal events S, background events B and reference events M follow Poisson distributions with mean values hSi = 60, hBi = 40 and hMi = 100. This implies a correction factor r = hBi/hMi = 0.4 for the reference experiment. From 104 MC experiments we obtain a distribution of µˆ, with mean value and width
0.019 and 0.34, respectively. The pure signal µˆ(S) has mean and width 0.001
√
and 0.13 (= 1/ 60). From our asymptotic error formula (13.21) we derive an error of 0.31, slightly smaller than the MC result. The discrepancy will be larger for lower statistics. It is typical for Poisson fluctuations.
13.5 Frequentist Confidence Intervals
We associate error intervals to measurements to indicate that the parameter of interest has a reasonably high probability to be located inside the interval. However to compute the probability a prior probability has to be introduced with the problem which we have discussed in Sect. 6.1. To circumvent this problem, J. Neyman has proposed a method to construct intervals without using prior probabilities. Unfortunately, as it is often the case, one problem is traded for another one.
Neyman’s confidence intervals have the following defining property: The true pa-
rameter lies in the interval on the average in the fraction C of intervals of confidence level C. In other words: Given a true value θ, a measurement t will include it in its associated confidence interval [t1, t2] – “cover” it – with probability C. (Remark that this does not necessarily imply that given a certain confidence interval the true value is included in it with probability C.)
Traditionally chosen values for the confidence level are 68.3%, 90%, 95% – the former corresponds to the standard error interval of the normal distribution.
Confidence intervals are constructed in the following way:
For each parameter value θ a probability interval [t1(θ), t2(θ)] is defined, such that the probability that the observed value t of θ is located in the interval is equal to the confidence level C:
t |
|
|
|
P {t1(θ) ≤ t ≤ t2(θ)} = Zt1 |
2 |
f(t|θ)dt = C . |
(13.22) |
Of course the p.d.f. f(t|θ) or error distribution of the estimator t must be known. To fix the interval completely, an additional condition is applied. In the univariate case, a common procedure is to choose central intervals,
P {t < t1} = P {t > t2} = 1 − C .
2
Other conventions are minimum length and equal probability intervals defined by f(t1) = f(t2). The confidence interval consists of those parameter values which include the measurement tˆ within their probability intervals. Somewhat simplified: Parameter values are accepted, if the observation is compatible with them.
354 |
13 Appendix |
|
|
|
|
|
Θ |
|
|
|
|
|
8 |
|
|
|
|
|
|
|
t1HΘL |
|
|
|
6 |
|
|
|
t2HΘL |
|
|
|
|
|
|
|
Θmax |
|
|
|
|
|
4 |
|
|
|
|
|
Θmin |
|
|
|
|
|
2 |
|
|
|
|
|
|
|
t=4 |
|
t |
|
2 |
4 |
6 |
8 |
|
|
10 |
Fig. 13.1. Confidence belt. The shaded area is the confidence belt, consisting of the probability intervals [t1(θ), t2(θ)] for the estimator t. The observation t = 4 leads to the confidence interval [θmin , θmax].
The one-dimensional case is illustrated in Fig. 13.1. The pair of curves t = t1(θ) , t = t2(θ) in the (t, θ)-plane comprise the so-called confidence belt . To the measurement tˆ = 4 then corresponds the confidence interval [θmin, θmax] obtained
by inverting the relations t1,2(θmax,min) = tˆ, i.e. the section of the straight line t = tˆ parallel to the θ axis.
The construction shown in Fig. 13.1 is not always feasible: It has to be assumed that t1,2(θ) are monotone functions. If the curve t1(θ) has a maximum say at θ = θ0, then the relation t1(θ) = tˆ cannot always be inverted: For tˆ > t1(θ0) the confidence belt degenerates into a region bounded from below, while for tˆ < t1(θ0) there is no unique solution. In the first case one usually declares a lower confidence bound as an infinite interval bounded from below. In the second case one could construct a set of disconnected intervals, some of which may be excluded by other arguments.
The construction of the confidence contour in the two-parameter case is illustrated in Fig. 13.2 where for simplicity the parameter and the observation space are chosen such that they coincide. For each point θ1, θ2 in the parameter space we fix a probability contour which contains a measurement of the parameters with probability C. Those parameter points with probability contours passing through the actual
ˆ |
ˆ |
are located at the confidence contour. All parameter pairs located |
measurement θ1 |
, θ2 |
inside the shaded area contain the measurement in their probability region.
Frequentist statistics avoids prior probabilities. This feature, while desirable in general, can have negative consequences if prior information exists. This is the case if the parameter space is constrained by mathematical or physical conditions. In frequentist statistics it is not possible to exclude un-physical parameter values without introducing additional complications. Thus, for instance, a measurement could lead for a mass to a 90% confidence interval which is situated completely in the negative region, or for an angle to a complex angular region. The problem is mitigated somewhat by a newer method [78], but not without introducing other complications [79], [80].
13.6 Comparison of Di erent Inference Methods |
355 |
Fig. 13.2. Confidence interval. The shaded area is the confidence region for the two-
dimensional measurement (ˆ ˆ ). The dashed curves indicate probability regions associated
θ1,θ2
to the locations denoted by capital letters.
13.6 Comparison of Di erent Inference Methods
13.6.1 A Few Examples
Before we compare the di erent statistical philosophies let us look at a few examples.
Example 150. Coverage: performance of magnets
A company produces magnets which have to satisfy the specified field strength within certain tolerances. The various measurements performed by the company are fed into a fitting procedure producing 99% confidence intervals which are used to accept (if they are inside the tolerances) or reject the product before sending it o . The client is able to repeat the measurement with high precision and accepts only magnets within the agreed specification. To calculate the price the company must rely on the condition that the confidence interval in fact covers the true value with the presumed confidence level.
Example 151. Bias in the measurements for a mass determination
356 13 Appendix
The mass of a particle is determined from the momenta and the angular configuration of its decay products. The mean value of the masses from many events is reported. The momenta of the charged particles are measured by means of a spectrometer consisting of a magnet and tracking chambers. In this configuration, the χ2 fit of the absolute values of track momenta and consequently also the mass estimates are biased. This bias, which can be shown to be positive, propagates into the mean value. Here a bias in the momentum fit has to be corrected for, because it would lead to a systematic shift of the resulting average of the mass values.
Example 152. Inference with known prior
We repeat an example presented in Sect. 6.2.2. In the reconstruction of a specific, very interesting event, for instance providing experimental evidence for a new particle, we have to infer the distance θ between the production and decay vertices of an unstable particle produced in the reaction. From its momentum and its known mean life we calculate its expected decay length λ. The prior density for the actual decay length θ is π(θ) = exp(−θ/λ)/λ. The experimental distance measurement which follows a Gaussian with standard deviation s yields d. According to (6.2.2), the p.d.f. for the actual distance is given by
f(θ|d) = R ∞ e−(d−θ)2/(2s2)e−θ/λdθ .
0
This is an ideal situation. We can determine the mean value and the standard deviation or the mode of the θ distribution and an asymmetric error interval with well defined probability content, for instance 68.3%. The confidence level is of no interest and due to the application of the prior the estimate of θ is biased, but this is irrelevant.
Example 153. Bias introduced by a prior
We now modify and extend our example. Instead of the decay length we discuss the lifetime of the particle. The reasoning is the same, we can apply the prior and determine an estimate and an error interval. We now study N decays, to improve our knowledge of the mean lifetime τ of the particle species. For each individual decay we use a prior with an estimate of τ as
known from previous experiments, determine each time the lifetime ti and the |
|||
mean value t = |
|
ˆ |
|
|
ti/N from all measurements. Even though the individual |
||
¯ |
|
|
ˆ |
time estimates |
are improved by applying the prior the average t is a very bad |
||
|
P |
ˆ |
|
|
|
|
¯ |
estimate of τ because the ti are biased towards low values and consequently also their mean value is shifted. (Remark that in this and in the third example we have two types of parameters which we have to distinguish. We discuss the e ect of a bias a icting the primary parameter set, i.e. λ, respectively τ).
Example 154. Comparing predictions with strongly di ering accuracies: Earth quake
Two theories H1, H2 predict the time θ of an earth quake. The predictions di er in the expected values as well as in the size of the Gaussian errors:
H1 : θ1 = (7.50 ± 2.25) h ,
H2 : θ2 = (50 ± 100) h .