c h a pt e r tv。 The simple regression model The simple regression model can be used to study the relationship between two variables. For reasons we will see, the simple regression model has limita tions as a general tool for empirical analysis. Nevertheless, it is sometimes appropriate as an empirical tool. Learning how to interpret the simple regression model is good practice for studying multiple regression, which we'll do in subse quent chapters 21 DEFINTION OF THE SIMPLE REGRESSION MODE Much of applied econometric analysis begins with the following premise: y and x are two variables, representating some population, and we are interested in"explaining y in terms of x, or in"studying how y varies with changes in x. " We discussed some exam- ples in Chapter 1, including: y is soybean crop yield and x is amount of fertilizer; y is hourly wage and x is years of education; y is a community crime rate and x is number In writing down a model that will"explain y in terms of x, " we must confront three issues. First, since there is never an exact relationship between two variables, how de we allow for other factors to affect y? Second, what is the functional relationship between y and x And third, how can we be sure we are capturing a ceteris paribus rela- tionship between y and x(if that is a desired goal) We can resolve these ambiguities by writing down an equation relating y to x. A simple e quation Bo+ B 2们 Equation(2. 1), which is assumed to hold in the population of interest, defines the sim- ple linear regression model. It is also called the two-variable linear regression model or bivariate linear regression model because it relates the two variables x and y. We now discuss the meaning of each of the quantities in(2.1 ).(Incidentally, the term"regres- sion” has origins that cations, so we will not explain it here. See Stigler [1986] for an engaging history of
The simple regression model can be used to study the relationship between two variables. For reasons we will see, the simple regression model has limitations as a general tool for empirical analysis. Nevertheless, it is sometimes appropriate as an empirical tool. Learning how to interpret the simple regression model is good practice for studying multiple regression, which we’ll do in subsequent chapters. 2.1 DEFINITION OF THE SIMPLE REGRESSION MODEL Much of applied econometric analysis begins with the following premise: y and x are two variables, representating some population, and we are interested in “explaining y in terms of x,” or in “studying how y varies with changes in x.” We discussed some examples in Chapter 1, including: y is soybean crop yield and x is amount of fertilizer; y is hourly wage and x is years of education; y is a community crime rate and x is number of police officers. In writing down a model that will “explain y in terms of x,” we must confront three issues. First, since there is never an exact relationship between two variables, how do we allow for other factors to affect y? Second, what is the functional relationship between y and x? And third, how can we be sure we are capturing a ceteris paribus relationship between y and x (if that is a desired goal)? We can resolve these ambiguities by writing down an equation relating y to x. A simple equation is y 0 1x u. (2.1) Equation (2.1), which is assumed to hold in the population of interest, defines the simple linear regression model. It is also called the two-variable linear regression model or bivariate linear regression model because it relates the two variables x and y. We now discuss the meaning of each of the quantities in (2.1). (Incidentally, the term “regression” has origins that are not especially important for most modern econometric applications, so we will not explain it here. See Stigler [1986] for an engaging history of regression analysis.) 22 Chapter Two The Simple Regression Model d 7/14/99 4:30 PM Page 22
The Simple Regression Model When related by(2. 1), the variables y and x have several different names used interchangeably, as follows. y is called the dependent variable, the explained vari able, the response variable, the predicted variable, or the regressand. x is called the independent variable, the explanatory variable, the control variable, the pre- dictor variable, or the regressor. (The term covariate is also used for x )The terms dependent variable"and"independent variable"are frequently used in economet rics. But be aware that the label"independent "here does not refer to the statistic lotion of independence between random variables(see Appendix B). The terms"explained"and"explanatory"variables are probably the most descrip- tive "Response"and"control"are used mostly in the experimental sciences, where the variable r is under the experimenter's control. We will not use the terms"predicted vari- le"and "predictor, although you sometimes see these. Our terminology for simple regression Is Table 2.1 Terminology for Simple Regression Dependent Variable Independent Variable Explained Variable Explanatory Variable Response Variable Control variable Predicted Variable Predictor Variable Regressand Regressor The variable u, called the error term or disturbance in the relationship, represents factors other than x that affect y. A simple regression analysis effectively all fac- tors affecting y other than x as being unobserved. You can usefully think of u as stand- ing for"unobserved. quation(2. 1)also addresses the issue of the functional relationship between y and x. If the other factors in u are held fixed. so that the change in u is zero. Au=0. then x has a linear effect on y △y=B1△rif△a=0 22 Thus, the change in y is simply B, multiplied by the change in x. This means that B,is the slope parameter in the relationship between y and x holding the other factors in u fixed; it is of primary interest in applied economics. The intercept parameter Bo also has its uses, although it is rarely central to an analysi
When related by (2.1), the variables y and x have several different names used interchangeably, as follows. y is called the dependent variable, the explained variable, the response variable, the predicted variable, or the regressand. x is called the independent variable, the explanatory variable, the control variable, the predictor variable, or the regressor. (The term covariate is also used for x.) The terms “dependent variable” and “independent variable” are frequently used in econometrics. But be aware that the label “independent” here does not refer to the statistical notion of independence between random variables (see Appendix B). The terms “explained” and “explanatory” variables are probably the most descriptive. “Response” and “control” are used mostly in the experimental sciences, where the variable x is under the experimenter’s control. We will not use the terms “predicted variable” and “predictor,” although you sometimes see these. Our terminology for simple regression is summarized in Table 2.1. Table 2.1 Terminology for Simple Regression y x Dependent Variable Independent Variable Explained Variable Explanatory Variable Response Variable Control Variable Predicted Variable Predictor Variable Regressand Regressor The variable u, called the error term or disturbance in the relationship, represents factors other than x that affect y. A simple regression analysis effectively treats all factors affecting y other than x as being unobserved. You can usefully think of u as standing for “unobserved.” Equation (2.1) also addresses the issue of the functional relationship between y and x. If the other factors in u are held fixed, so that the change in u is zero, u 0, then x has a linear effect on y: y 1x if u 0. (2.2) Thus, the change in y is simply 1 multiplied by the change in x. This means that 1 is the slope parameter in the relationship between y and x holding the other factors in u fixed; it is of primary interest in applied economics. The intercept parameter 0 also has its uses, although it is rarely central to an analysis. Chapter 2 The Simple Regression Model 23 d 7/14/99 4:30 PM Page 23
Regressie alysis with Cross-Sectional Data EXAmPLE2 Suppose that soybean yield is determined by the model Bo+ Bfertili so that y yield and x= fertilizer. The agricultural researcher is interested in the effect of fertilizer on yield, holding other factors fixed. This effect is given by B. The error term u contains factors such as land quality, rainfall, and so on. The coefficient B, measures the effect of fertilizer on yield, holding other factors fixed: Yield= B, Fertilizer. E. 2 (A Simple wage equation) A model relating a persons wage to observed education and other unobserved factors is age= Bo+ B,educ 24 If wage is measured in dollars per hour and educ is years of education, then B, measures the change in hourly wage given another year of education, holding all other factors fixed Some of those factors include labor force experience, innate ability tenure with current d innumerable other thi The linearity of (2. 1)implies that a one-unit change in x has the le effect on y, t gardless of the initial value of x. This is unrealistic for many economic applications For example, in the wage-education example, we might want to allow for increasing returns: the next year of education has a larger effect on wages than did the previous year. We will see how to allow for such possibilities in Section 2.4 The most difficult issue to address is whether model (2. 1)really allows us to draw eteris paribus conclusions about how x affects y. We just saw in equation(2.2)that B1 does measure the effect of x on y, holding all other factors (in m) fixed. Is this the end of the causality issue? Unfortunately, no. How can we hope to learn in general about he ceteris paribus effect of x on y, holding other factors fixed, when we are ignoring all nose other factors? As we will see in Section 2.5, we are only able to get reliable estimators of Bo and B, from a random sample of data when we make an assumption restricting how the unobservable u is related to the explanatory variable x. Without such a restriction, we will not be able to estimate the ceteris paribus effect, B,. Because u and x are random variables, we need a concept grounded in probability. Before we state the key assumption about how x and u are related, there is one assump- ion about u that we can always make. As long as the intercept Bo is included in the equa- tion, nothing is lost by assuming that the average value of u in the pe
EXAMPLE 2.1 (Soybean Yield and Fertilizer) Suppose that soybean yield is determined by the model yield 0 1fertilizer u, (2.3) so that y yield and x fertilizer. The agricultural researcher is interested in the effect of fertilizer on yield, holding other factors fixed. This effect is given by 1. The error term u contains factors such as land quality, rainfall, and so on. The coefficient 1 measures the effect of fertilizer on yield, holding other factors fixed: yield 1fertilizer. EXAMPLE 2.2 (A Simple Wage Equation) A model relating a person’s wage to observed education and other unobserved factors is wage 0 1educ u. (2.4) If wage is measured in dollars per hour and educ is years of education, then 1 measures the change in hourly wage given another year of education, holding all other factors fixed. Some of those factors include labor force experience, innate ability, tenure with current employer, work ethics, and innumerable other things. The linearity of (2.1) implies that a one-unit change in x has the same effect on y, regardless of the initial value of x. This is unrealistic for many economic applications. For example, in the wage-education example, we might want to allow for increasing returns: the next year of education has a larger effect on wages than did the previous year. We will see how to allow for such possibilities in Section 2.4. The most difficult issue to address is whether model (2.1) really allows us to draw ceteris paribus conclusions about how x affects y. We just saw in equation (2.2) that 1 does measure the effect of x on y, holding all other factors (in u) fixed. Is this the end of the causality issue? Unfortunately, no. How can we hope to learn in general about the ceteris paribus effect of x on y, holding other factors fixed, when we are ignoring all those other factors? As we will see in Section 2.5, we are only able to get reliable estimators of 0 and 1 from a random sample of data when we make an assumption restricting how the unobservable u is related to the explanatory variable x. Without such a restriction, we will not be able to estimate the ceteris paribus effect, 1. Because u and x are random variables, we need a concept grounded in probability. Before we state the key assumption about how x and u are related, there is one assumption about u that we can always make. As long as the intercept 0 is included in the equation, nothing is lost by assuming that the average value of u in the population is zero. Part 1 Regression Analysis with Cross-Sectional Data 24 d 7/14/99 4:30 PM Page 24
The Simple Regression Model Mathematically, u)=0. Importantly, assume (2.5) says nothing about the relationship between u and x but sim- ply makes a statement about the distribution of the unobservables in the population Using the previous examples for illustration, we can see that assumption (2.5)is not very restrictive. In Example 2.1, we lose nothing by normalizing the unobserved factors affect- ng soybean yield, such as land quality, to have an average of zero in the population of all cultivated plots. The same is true of the unobserved factors in Example 2. 2. without loss of generality, we can assume that things such as average ability are zero in the pop- ation of all working people. If you are not convinced, you can work through Problem 2.2 to see that we can always redefine the intercept in equation(2.1)to make(2.5)true We now turn to the crucial assumption regarding how u and x are related. A natural measure of the association between two random variables is the correlation coefficient. (See Appendix B for definition and properties. )If u and x are uncorrelated, then, as ran- dom variables, they are not linearly related. Assuming that u and x are uncorrelated goes a long way toward defining the sense in which u and x should be unrelated in equation (2.1). But it does not go far enough, because correlation measures only linear depen dence between u and x. Correlation has a somewhat counterintuitive feature: it is possi ble for u to be uncorrelated with x while being correlated with functions of x, such as x2.(See Section B 4 for further discussion. This possibility is not acceptable for most egression purposes, as it causes problems for interpretating the model and for derivin statistical properties. a better assumption involves the expected value of u given x Because u and x are random variables. we can define the conditional distributio l given any value of x. In particular, for any x, we can obtain the expected (or average value of u for that slice of the population described by the value of x. The crucial assumption is that the average value of u does not depend on the value of x. We can write this as E(ux)=E(u)=0 2.6 where the second equality follows from(2.5). The first equality in equation(2.6)is the new assumption, called the zero conditional mean assumption. It says that, for any given value of x, the average of the unobservables is the same and therefore must equal the average value of u in the entire population. Let us see what (2.6) entails in the wage example. To simplify the discussion, assume that u is the same as innate ability. Then(2.6) requires that the average level of ability is the same regardless of years of education. For example, if E(abil/8)denotes the average ability for the group of all people with eight years of education, and E(abil 16)denotes the average ability among people in the population with 16 years of education, then(2.6) implies that these must be the same. In fact, the average ability level must be the same for all education levels. If, for example, we think that average ability increases with years of education, then(2.6)is false. (This would happen if, on average, people with more ability choose to become more educated. ) As we cannot bserve innate ability, we have no way of knowing whether or not average ability is the
Mathematically, E(u) 0. (2.5) Importantly, assume (2.5) says nothing about the relationship between u and x but simply makes a statement about the distribution of the unobservables in the population. Using the previous examples for illustration, we can see that assumption (2.5) is not very restrictive. In Example 2.1, we lose nothing by normalizing the unobserved factors affecting soybean yield, such as land quality, to have an average of zero in the population of all cultivated plots. The same is true of the unobserved factors in Example 2.2. Without loss of generality, we can assume that things such as average ability are zero in the population of all working people. If you are not convinced, you can work through Problem 2.2 to see that we can always redefine the intercept in equation (2.1) to make (2.5) true. We now turn to the crucial assumption regarding how u and x are related. A natural measure of the association between two random variables is the correlation coefficient. (See Appendix B for definition and properties.) If u and x are uncorrelated, then, as random variables, they are not linearly related. Assuming that u and x are uncorrelated goes a long way toward defining the sense in which u and x should be unrelated in equation (2.1). But it does not go far enough, because correlation measures only linear dependence between u and x. Correlation has a somewhat counterintuitive feature: it is possible for u to be uncorrelated with x while being correlated with functions of x, such as x2 . (See Section B.4 for further discussion.) This possibility is not acceptable for most regression purposes, as it causes problems for interpretating the model and for deriving statistical properties. A better assumption involves the expected value of u given x. Because u and x are random variables, we can define the conditional distribution of u given any value of x. In particular, for any x, we can obtain the expected (or average) value of u for that slice of the population described by the value of x. The crucial assumption is that the average value of u does not depend on the value of x. We can write this as E(ux) E(u) 0, (2.6) where the second equality follows from (2.5). The first equality in equation (2.6) is the new assumption, called the zero conditional mean assumption. It says that, for any given value of x, the average of the unobservables is the same and therefore must equal the average value of u in the entire population. Let us see what (2.6) entails in the wage example. To simplify the discussion, assume that u is the same as innate ability. Then (2.6) requires that the average level of ability is the same regardless of years of education. For example, if E(abil8) denotes the average ability for the group of all people with eight years of education, and E(abil16) denotes the average ability among people in the population with 16 years of education, then (2.6) implies that these must be the same. In fact, the average ability level must be the same for all education levels. If, for example, we think that average ability increases with years of education, then (2.6) is false. (This would happen if, on average, people with more ability choose to become more educated.) As we cannot observe innate ability, we have no way of knowing whether or not average ability is the Chapter 2 The Simple Regression Model 25 d 7/14/99 4:30 PM Page 25
alysis with Cross-Sectional Data same for all education levels. But this is an issue that we must address before applying simple regression analysi In the fertilizer example, if fertilizer amounts are chosen independently of other fea- tures of the plots, then(2.6)will hold: the QU ESTION 2.1 average land quality will not depend on the amount of fertilizer. however. if more fer- that a score on a final exam, score, depends on classes tilizer is put on the higher quality plots of (attend) and unobserved factors that affect exam perfor- land, then the expected value of u changes with the level of fertilizer, and(2.6)fails score= Bo+ B,attend+u Assumption(2.6) gives B, another (2.7) interpretation that is often useful. Taking the expected value of (2.1)conditional on When would you expect this model to satisfy (2.6)? x and using E(ux)=0 gives E(ylx)=B+B,r 28 quation(2. 8)shows that the population regression function(PRF), E(ylx), is a lin- ear function of x. The linearity means that a one-unit increase in x changes the expect Figure 2.1
same for all education levels. But this is an issue that we must address before applying simple regression analysis. In the fertilizer example, if fertilizer amounts are chosen independently of other features of the plots, then (2.6) will hold: the average land quality will not depend on the amount of fertilizer. However, if more fertilizer is put on the higher quality plots of land, then the expected value of u changes with the level of fertilizer, and (2.6) fails. Assumption (2.6) gives 1 another interpretation that is often useful. Taking the expected value of (2.1) conditional on x and using E(ux) 0 gives E(yx) 0 1x (2.8) Equation (2.8) shows that the population regression function (PRF), E(yx), is a linear function of x. The linearity means that a one-unit increase in x changes the expectPart 1 Regression Analysis with Cross-Sectional Data 26 QUESTION 2.1 Suppose that a score on a final exam, score, depends on classes attended (attend) and unobserved factors that affect exam performance (such as student ability): score 0 1attend u (2.7) When would you expect this model to satisfy (2.6)? Figure 2.1 E(yx) as a linear function of x. y x1 E(yx) 0 1x x2 x3 d 7/14/99 4:30 PM Page 26