Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Econometrics2011

.pdf
Скачиваний:
10
Добавлен:
21.03.2016
Размер:
1.77 Mб
Скачать

CHAPTER 3. CONDITIONAL EXPECTATION AND PROJECTION

63

where x1 and x2 are the observed variables, u is an ` 1 unobserved random factor, and h is a functional relationship. This framework includes as a special case the random coe¢ cient model (3.27) studied earlier. We de…ne the causal e¤ect of x1 within this model as the change in y due to a change in x1 holding the other variables x2 and u constant.

De…nition 3.28.1 In the model (3.48) the the causal e¤ ect of x1 on y is

C(x1; x2; u) = r1h (x1; x2; u) ;

(3.49)

the change in y due to a change in x1; holding x2 and u constant.

To understand this concept, imagine taking a single individual. As far as our structural model is concerned, this person is described by their observables x1 and x2; and their unobservables u. In a wage regression the unobservables would include characteristics such as the person’ abilities, skills, work ethic, interpersonal connections, and preferences. The causal e¤ect of x1 (say, education) is the change in the wage as x1 changes, holding constant all other observables and unobservables.

It may be helpful to understand that (3.49) is a de…nition, and does not necessarily describe causality in a fundamental or experimental sense. Perhaps it would be more appropriate to label (3.49) as a structural e¤ect (the e¤ect within the structural model).

Sometimes it is useful to write this relationship as a potential outcome function

y(x1) = h (x1; x2; u)

where the notation implies that y(x1) is holding x2 and u constant.

A popular example arises in the analysis of treatment e¤ects with a binary regressor x1. Let x1 = 1 indicate treatment (e.g. a medical procedure) and x1 = 0 indicating non-treatment. In this case y(x1) can be written

y(0) = h (0; x2; u) y(1) = h (1; x2; u)

In the literature on treatment e¤ects, it is common to refer to y(0) and y(1) as the latent outcomes associated with non-treatment and treatment, respectively. That is, for a given individual, y(0) is the health outcome if there is no treatment, and y(1) is the health outcome if there is treatment. The causal e¤ect of treatment for the individual is the change in their health outcome due to treatment –the change in y as we hold both x2 and u constant:

C (x2; u) = y(1) y(0):

This is random (a function of x2 and u) as both potential outcomes y(0) and y(1) are di¤erent across individuals.

In a sample, we cannot observe both outcomes from the same individual, we only observe the

realized value

8 y(0)

if

x1

= 0

y =

 

:

if

x1

= 1

 

< y(1)

As the causal e¤ect varies across individuals and is not observable, it cannot be measured on the individual level. We therefore focus on aggregate causal e¤ects, in particular what is known as the average causal e¤ect.

CHAPTER 3. CONDITIONAL EXPECTATION AND PROJECTION

64

De…nition 3.28.2 In the model (3.48) average causal e¤ ect of x1 on y conditional on x2 is

Z

ACE(x1; x2) = E(C(x1; x2; u) j x1; x2) = r1h (x1; x2; u) f(u j x1; x2)du

R`

(3.50)

where f(u j x1; x2) is the conditional density of u given x1; x2.

We can think of the average causal e¤ect ACE(x1; x2) as the average e¤ect in the general population.

What is the relationship beteween the average causal e¤ect ACE(x1; x2) and the regression derivative r1m (x1; x2)? Equation (3.48) implies that the CEF is

m(x1; x2) =

E(h (x1; x2; u) j x1; x2)

 

Z

=

R` h (x1; x2; u) f(u j x1; x2)du;

the average causal equation, averaged over the conditional distribution of the observed component u.

Applying the marginal e¤ect operator, the regression derivative is

 

r1m(x1; x2) =

ZR` r1h (x1; x2; u) f(u j x1; x2)du + ZR` h (x1; x2; u) r1f(u j x1; x2)du

=

ACE(x1; x2) + ZR` h (x1; x2; u) r1f(u j x1; x2)du:

(3.51)

In general, the average causal e¤ect is not the regression derivative. However, they equal when the second component in (3.51) is zero. This occurs when r1f(u j x1; x2) = 0; that is, when the conditional density of u given (x1; x2) does not depend on x1: The condition is su¢ ciently important that it has a special name in the treatment e¤ects literature.

De…nition 3.28.3 Conditional Independence Assumption (CIA). Conditional on x2; the random variables x1 and u are statistically independent.

The CIA implies f(u j x1; x2) = f(u j x2) does not depend on x1; and thus r1f(u j x1; x2) = 0: Thus the CIA implies that r1m(x1; x2) = ACE(x1; x2); the regression derivative equals the average causal e¤ect.

Theorem 3.28.1 In the structural model (3.48), the Conditional Independence Assumption implies

r1m(x1; x2) = ACE(x1; x2)

the regression derivative equals the average causal e¤ect for x1 on y conditional on x2.

This is a fascinating result. It shows that whenever the unobservable is independent of the treatment variable (after conditioning on appropriate regressors) the regression derivative equals the average causal e¤ect. In this case, the CEF has causal economic meaning, giving strong justi…cation

CHAPTER 3. CONDITIONAL EXPECTATION AND PROJECTION

65

to estimation of the CEF. Our derivation also shows the critical role of the CIA. If CIA fails, then the equality of the regression derivative and ACE fails.

This theorem is quite general. It applies equally to the treatment-e¤ects model where x1 is binary or to more general settings where x1 is continuous.

It is also helpful to understand that the CIA is weaker than full independence of u from the regressors (x1; x2): The CIA was introduced precisely as a minimal su¢ cient condition to obtain the desired result. Full independence implies the CIA and implies that each regression derivative equal that variable’s average causal e¤ect, but full independence is not necessary in order to be able to causally interpret a subset of the regressors.

3.29Existence and Uniqueness of the Conditional Expectation*

In Sections 3.3 and 3.5 we de…ned the conditional mean when the conditioning variables x are discrete and when the variables (y; x) have a joint density. We have explored these cases because these are the situations where the conditional mean is easiest to describe and understand. However, the conditional mean can be de…ned without appealing to the properties of either discrete or continuous random variables. The conditional mean exists quite generally.

To justify this claim we now present a deep result from probability theory. What is says is that the conditional mean exists for all joint distributions (y; x): The only requirement is that y has a …nite mean.

Theorem 3.29.1 Existence of the Conditional Mean

If Ejyj < 1 then there exists a function m(x) such that for all measurable sets X

E(1 (x 2 X) y) = E(1 (x 2 X) m(x)) :

(3.52)

The function m(x) is almost everywhere unique, in the sense that if h(x) satis…es (3.52), then there is a set S such that Pr(S) = 1 and m(x) = h(x) for x 2 S: The function m(x) is called the conditional mean and is written m(x) = E(y j x) :

See, for example, Ash (1972), Theorem 6.3.3.

The conditional mean m(x) de…ned by (3.52) specializes to (3.7) when (y; x) have a joint density. The usefulness of de…nition (3.52) is that Theorem 3.29.1 shows that the conditional mean m(x) exists for all …nite-mean distributions. This de…nition allows y to be discrete or continuous, for x to be scalar or vector-valued, and for the components of x to be discrete or continuously distributed.

Theorem 3.29.1 also demonstrates that Ejyj < 1 is a su¢ cient condition for identi…cation of the conditional mean. (Recall, a parameter is identi…ed if it is uniquely determined by the distribution of the observed variables.)

Theorem 3.29.2 Identi…cation of the Conditional Mean

If Ejyj < 1; the conditional mean m(x) = E(y j x) is identi…ed almost everywhere.

CHAPTER 3. CONDITIONAL EXPECTATION AND PROJECTION

66

3.30Technical Proofs*

Proof of Theorem 3.6.1:

For convenience, assume that the variables have a joint density f (y; x). Since E(y j x) is a function of the random vector x only, to calculate its expectation we integrate with respect to the density fx (x) of x; that is

Z

E(E(y j x)) = E(y j x) fx (x) dx:

Rk

Substituting in (3.7) and noting that fyjx (yjx) fx (x) = f (y; x) ; we …nd that the above expression

equals

ZRk ZR yfyjx (yjx) dy fx (x) dx = ZRk

ZR yf (y; x) dydx = E(y)

 

 

 

the unconditional mean of y:

 

 

 

 

 

 

 

Proof of Theorem 3.6.2:

 

 

 

 

 

 

 

Again assume that the variables have a joint density. It is useful to observe that

 

 

f (yjx1; x2) f (x2jx1) =

f (y; x1; x2) f (x1; x2)

= f (y; x2jx1) ;

(3.53)

 

f (x1; x2)

 

 

f (x1)

the density of (y; x2) given x1: Here, we have abused notation and used a single symbol f to denote the various unconditional and conditional densities to reduce notational clutter.

Note that

ZR yf (yjx1; x2) dy:

 

E(y j x1; x2) =

(3.54)

Integrating (3.54) with respect to the conditional density of x2 given x1, and applying (3.53) we …nd that

 

E(E(y j x1; x2) j x1) =

ZRk2

E(y j x1; x2) f (x2jx1) dx2

 

 

Z

Z

 

 

=

Rk2

 

R yf (yjx1; x2) dy f (x2jx1) dx2

 

=

ZRk2

ZR yf (yjx1; x2) f (x2jx1) dydx2

 

 

Z

Z

 

 

=

Rk2

R yf (y; x2jx1) dydx2

 

=

E(y j x1)

as stated.

 

 

 

 

Proof of Theorem 3.6.3:

Z Z

E(g (x) y j x) = g (x) yfyjx (yjx) dy = g (x) yfyjx (yjx) dy = g (x) E(y j x)

R R

This is (3.9). The assumption that Ejg (x) yj < 1 is required for the …rst equality to be well-

de…ned. Equation (3.10) follows by applying the Simple Law of Iterated Expectations to (3.9).

Proof of Theorem 3.7.1: The assumption that Ey2 < 1 implies that all the conditional expectations below exist.

CHAPTER 3. CONDITIONAL EXPECTATION AND PROJECTION

67

Set z = E(y j x1; x2). By the conditional Jensen’ inequality (B.16),

 

(E(z j x1))2 E z2 j x1 :

 

 

Taking unconditional expectations, this implies

 

 

E(E(y j x1))2 E (E(y j x1; x2))2 :

 

 

Similarly,

:

 

(Ey)2 E (E(y j x1))2 E (E(y j x1; x2))2

(3.55)

The variables y; E(y j x1) and E(y j x1; x2) all have the same mean Ey; so the inequality (3.55)

implies that the variances are ranked monotonically:

 

0 var (E(y j x1)) var (E(y j x1; x2)) :

(3.56)

Next, for = Ey observe that

 

E(y E(y j x)) (E(y j x) ) = E(y E(y j x)) (E(y j x) ) = 0

 

so the decomposition

 

y = y E(y j x) + E(y j x)

 

satis…es

 

var (y) = var (y E(y j x)) + var (E(y j x)) :

(3.57)

The monotonicity of the variances of the conditional mean (3.56) applied to the variance decomposition (3.57) implies the reverse monotonicity of the variances of the di¤erences, completing the proof.

Proof of Theorem 3.8.1. Applying Minkowski’ Inequality (B.22) to e = y m(x);

(Ejejr)1=r = (Ejy m(x)jr)1=r (Ejyjr)1=r + (Ejm(x)jr)1=r < 1;

where the two parts on the right-hand are …nite since Ejyjr < 1 by assumption and Ejm(x)jr < 1

by the Conditional Expectation Inequality (B.17). The fact that (Ejejr)1=r < 1 implies Ejejr <

1:

Proof of Theorem 3.16.1. For part 1, by the Expectation Inequality (B.18), (A.6) and Assump-

tion 3.16.1,

 

 

 

 

 

 

 

= Ekxk2 < 1:

 

E xx0

E

xx0

 

 

 

 

 

Similarly, using the

Expectation Inequality (B.18), the Cauchy-Schwarz Inequality (B.20) and As-

 

 

 

 

 

 

 

 

sumption 3.16.1,

 

 

 

 

Ekxk2 1=2 Ey2 1=2 < 1:

 

kE(xy)k Ekxyk =

Thus the moments E(xy) and E(xx0) are …nite and well de…ned.

For part 2, the coe¢ cient = (E(xx0)) 1 E(xy) is well de…ned since (E(xx0)) 1 exists under Assumption 3.16.1.

Part 3 follows from De…nition 3.16.1 and part 2. For part 4, …rst note that

Ee2 = E y x0 2

=

2

 

2E

 

0

 

 

 

0E

 

0

 

Ey

 

+

 

xx

 

 

Ey2

 

 

yx

 

 

 

 

 

 

 

 

 

 

 

 

 

 

=

Ey2

 

2E yx0

 

E xx0

 

1 E(xy)

<

1

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 3. CONDITIONAL EXPECTATION AND PROJECTION

68

The …rst inequality holds because E(yx0) (E(xx0)) 1 E(xy) is a quadratic form and therefore necessarily non-negative. Second, by the Expectation Inequality (B.18), the Cauchy-Schwarz Inequality (B.20) and Assumption 3.16.1,

kE(xe)k Ekxek = Ekxk2 1=2 Ee2 1=2 < 1:

It follows that the expectation E(xe) is …nite, and is zero by the calculation (3.26). For part 6, Applying Minkowski’ Inequality (B.22) to e = y x0 ;

(Ejejr)1=r =

E y x0

r 1=r

 

 

 

 

 

 

 

 

x r

1=r

 

 

( y r)1=r +

 

(Ejyjr)1=r

+

(E x0

r)1=r

k

 

k

 

Ej j

 

Ek k

 

 

 

<

1;

 

 

 

 

 

 

 

the …nal inequality by assumption:

CHAPTER 3. CONDITIONAL EXPECTATION AND PROJECTION

69

Exercises

Exercise 3.1 Find E(E(E(y j x1; x2; x3) j x1; x2) j x1) :

Exercise 3.2 If E(y j x) = a + bx; …nd E(yx) as a function of moments of x:

Exercise 3.3 Prove Theorem 3.8.1.4 using the law of iterated expectations.

Exercise 3.4 Suppose that the random variables y and x only take the values 0 and 1, and have the following joint probability distribution

 

x = 0

x = 1

 

 

 

y = 0

.1

.2

y = 1

.4

.3

 

 

 

Find E(y j x) ; E y2 j x and var (y j x) for x = 0 and x = 1:

Exercise 3.5 Show that 2(x) is the best predictor of e2 given x:

(a)Write down the mean-squared error of a predictor h(x) for e2:

(b)What does it mean to be predicting e2?

(c)Show that 2(x) minimizes the mean-squared error and is thus the best predictor.

Exercise 3.6 Use y = m(x) + e to show that

var (y) = var (m(x)) + 2

Exercise 3.7 Show that the conditional variance can be written as

2(x) = E y2 j x (E(y j x))2 :

Exercise 3.8 Suppose that y is discrete-valued, taking values only on the non-negative integers, and the conditional distribution of y given x is Poisson:

Pr (y = j

j

x) =

exp ( x0 ) (x0 )j

; j = 0; 1; 2; :::

j!

 

 

 

Compute E(y j x) and var (y j x) : Does this justify a linear regression model of the form y = x0 + e?

Hint: If Pr (y = j) = exp( ) j ; then Ey = and var(y) = :

j!

Exercise 3.9 Suppose you have two regressors: x1 is binary (takes values 0 and 1) and x2 is categorical with 3 categories (A; B; C): Write E(y j x1; x2) as a linear regression.

Exercise 3.10 True or False. If y = x + e; x 2 R; and E(e j x) = 0; then E x2e = 0:

Exercise 3.11 True or False. If y = x + e; x 2 R; and E(xe) = 0; then E x2e = 0:

Exercise 3.12 True or False. If y = x0 + e and E(e j x) = 0; then e is independent of x:

Exercise 3.13 True or False. If y = x0 + e and E(xe) = 0; then E(e j x) = 0:

CHAPTER 3. CONDITIONAL EXPECTATION AND PROJECTION

70

Exercise 3.14 True or False. If y = x

+ e,

 

(e

j

x) = 0; and

e2

j

x = 2; a constant, then

e is independent of x:

 

0

 

 

 

E

 

 

 

E

 

 

Exercise 3.15 Let x and y have the joint density f (x; y) = 23 x2 + y2

 

 

on 0 x 1; 0 y 1:

 

 

 

 

 

 

 

 

 

 

+e: Compute the conditional mean

Compute the coe¢ cients of the best linear predictor y = + x

 

 

 

 

m(x) = E(y j x) : Are the best linear predictor and conditional mean di¤erent?

Exercise 3.16 Let x be a random variable with = Ex and 2 = var(x): De…ne

g x

j

; 2

 

=

 

 

 

 

x

:

 

 

 

 

 

 

 

(x )2 2

 

 

 

 

 

Show that Eg (x j m; s) = 0 if and only if m = and s = 2:

 

 

 

 

 

Exercise 3.17 Suppose that

 

 

 

 

 

0

 

 

1

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

x =

 

 

x2

A

 

 

 

 

 

 

 

 

 

 

 

@ x3

 

 

 

 

 

and x3 = 1 + 2x2 is a linear function of x2:

(a)Show that Qxx = E(xx0) is not invertible.

(b)Use a linear transformation of x to …nd an expression for the best linear predictor of y given x. (Be explicit, do not just use the generalized inverse formula.)

Exercise 3.18 Show (3.42)-(3.43), namely that for

d( ) = E m(x) x0 2

then

= argmin d( )

2Rk

=E xx0 1 E(xm(x))

=E xx0 1 E(xy) :

Hint: To show E(xm(x)) = E(xy) use the law of iterated expectations.

Chapter 4

The Algebra of Least Squares

4.1Introduction

In this chapter we introduce the popular least-squares estimator. Most of the discussion will be algebraic, with questions of distribution and inference defered to later chapters.

4.2Least Squares Estimator

In Section 3.16 we derived and discussed the best linear predictor of y given x for a pair of random variables (y; x) 2 R Rk; and called this the linear projection model. Applied to observations from a random sample with observations (yi; xi : i = 1; :::; n) this model takes the form

 

yi = xi0 + ei

(4.1)

where is de…ned as

 

 

 

= argmin S( );

(4.2)

 

2Rk

 

the minimizer of the expected squared error

 

and has the explicit solution

S( ) = E yi xi0 2 ;

(4.3)

 

= E xixi0 1 E(xiyi) :

(4.4)

When a parameter is de…ned as the minimizer of a function as in (4.2), a standard approach to estimation is to construct an empirical analog of the function, and de…ne the estimator of the parameter as the minimizer of the empirical function.

The empirical analog of the expected squared error (4.3) is the sample average squared error

1

n

 

 

2

 

 

 

Xi

 

 

 

(4.5)

Sn( ) = n

=1

yi xi0

 

 

 

 

 

 

 

 

1

= nSSEn( )

where

SSEn( ) = Xn yi x0i 2

i=1

is called the sum-of-squared-errors function. An estimator for is the minimizer of (4.5):

b

= argmin Sn( ):

2Rk

71

CHAPTER 4. THE ALGEBRA OF LEAST SQUARES

72

Figure 4.1: Sum-of-Squared Errors Function

b

Alternatively, as Sn( ) is a scale multiple of SSEn( ); we may equivalently de…ne as the minimizer of SSEn( ): Hence is commonly called the least-squares (LS) (or ordinary least

squares (OLS)) estimator of . As discussed in Chapter 2, the hat “^”on signi…es that it is

an estimator of the parameterb : If we want to be explicit about the estimation method, we can

write ols to signify that it is the OLS estimator.

 

b

 

 

Tobvisualize the quadratic function Sn( ), Figure 4.1 displays an example sum-of-squared er-

rors function SSEn( ) for the case k = 2: The least-squares estimator is the the pair (

;

)

minimizing this function.

b

b1

b2

 

4.3Solving for Least Squares

, expand the SSE function to …nd

 

 

 

To solve for b

SSE ( ) =

n

y2

2 0

n

x y + 0

n

x x0

 

n

X

i

 

Xi

i i

X

i i

 

 

i=1

 

 

=1

 

i=1

 

which is quadratic in the vector argument . The …rst-order-condition for minimization of SSEn( )

is

 

 

 

 

 

 

 

 

n

 

 

 

n

 

 

 

 

 

 

 

n

 

 

 

b

 

Xi

 

 

 

X

 

 

b

 

 

 

0 =

@

SSE

 

( ) =

 

2

 

x

y

 

+ 2

x

x0

:

(4.6)

 

 

 

@

 

n

 

 

=1

i

 

i

i=1

i

i

 

 

 

 

 

Pi=1 xixi0

 

 

 

 

 

 

 

 

 

By inverting the k

k matrix

n

 

 

1

 

n

 

 

 

 

 

 

 

 

 

 

 

we …nd an explicit formula for the least-squares estimator

 

 

 

=

 

 

xixi0

!

 

X

xiyi!:

 

 

 

(4.7)

 

 

 

b

 

Xi

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

=1

 

 

 

i=1

 

 

 

 

 

This is the natural estimator of the best linear projection coe¢ cient de…ned in (4.2), and can also be called the linear projection estimator.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]