2.13

# Multiple regression exercise (Optional)

[Note: Again, the exercise presented in this step requires Microsoft Excel. If you already know how to run linear regression using a different computer program, you can just skim through the instructions and move on. If you do not have Microsoft Excel and cannot participate in the exercise, just keep in mind the following key lessons. When we have data on more than two variables, 1) we can still easily quantify how they are related to each other, and 2) adding more crime-relevant variables in the regression equation usually makes the research design stronger. After this week, we will not have any more data exercise involving Miscrosoft Excel.]

Below we will see how we can use Microsoft Excel to run a linear regression with multiple explanatory variables (“multiple regression”).

This time, we want to find the best-fitting line for the equation below, which expresses the rate of larceny as a function of three explanatory variables (unemployment, inequality, and poverty).

$\quad\quad\quad\quad$ $larceny=α_0+α_1*unemployment+α_2*inequality+α_3*poverty$

Open the attached data file, which contains information on the rates of unemployment, inequality, poverty, and larceny from the 200 largest U.S. counties in 2000.

• Step 1) We will use Excel’s Data Analysis feature to run multiple regression. Go to Professor Colin Cameron’s webpage (from UC Davis) where you can find a short, practical guide for running multiple regression using Microsoft Excel. Follow the described steps and you will obtain the following result.
Coefficient
Intercept 1871.602
Unemployment -18458.1
Gini Coefficient -15.6029
Poverty 16669.36

This table implies that the best fitting line for our data is: $larceny=1871.6-18458.1*unemployment-15.6*inequality+16669.4*poverty$

(Note: The table you obtain from this Data-Analysis feature includes standard errors, t-stats, and p-values, which are crucial in determining statistical significance of the coefficients. However, given the time and space constraints, we will just focus on the signs and magnitudes of the coefficients here. Please refer to an econometrics textbook for further reference. Professor Jeffrey Wooldridge’s “Introductory Econometrics” is a great undergraduate-level textbook. Professor Colin Cameron’s “Microeconometrics: Methods and Applications” is also a great econometrics textbook, but is more at the graduate-level.)

### Interpreting Regression Coefficients

The regression result tells us how larceny is related to unemployment, inequality, and poverty rates. We can use the best fitting line for our data ( larceny=1871.6-18458.1unemployment-15.6inequality+16669.4*poverty) to predict larceny rates in counties with different economic conditions. For example, Harris County (county FIPS 48201) had 6.9 percent unemployment rate and 15.0 poverty rate and its Gini Coefficient was 0.44 in 2000. Based on these numbers, the predicted value of larceny rate in Harris County is equal to 3099.4. This predicted value is pretty close to the observed larceny rate of 2884.2

### Panel Data and Cross-sectional Data

So far, we looked at the data on crime, unemployment, inequality, and poverty rates from large U.S. counties from year 2000 and found the best fitting linear equation based on the data. Here, we are making an “across-unit” comparison, comparing crime rates of different counties with different levels of unemployment, inequality, and poverty. However, it is likely that the difference in crime rates between different counties is driven by many other factors than just unemployment, inequality, and poverty. (For example, there are myriads of crime-relevant differences between, say, New York, Detroit, and San Francisco.) We can try to collect more information on these other crime-relevant differences across different counties, but no matter how hard we try, finding data on all crime-relevant differences across different counties is going to be a very difficult task, to say the least.

Alternatively, we can use a “within-unit” analysis, focusing on the change in crime rates within each county over time and trying to explain it based on the change in demographic, economic, and other crime-relevant characteristics within each county over time. (On the contrary, in our “across-unit” analysis earlier, we attempted to explain the difference in crime rates across different counties based on their observable characteristics at a given moment.) Note that, in order to implement this within-unit analysis, we need data on the same units of observations from multiple time periods (“panel data”). We only needed data from a single point in time (“cross-sectional data”) to run an across-unit analysis.