Skip to 0 minutes and 1 second Now that we know the basic intuition of a linear regression, let’s work on a simple example. This example is from my own research project on inequality and crime. To give you a little background, many researchers thought that more inequality will cause more crimes. Sociologists and criminologists had their reasons, but economist’s main rationale came
Skip to 0 minutes and 24 seconds from the rational choice model we saw earlier: an individual will choose to commit a crime if the net gains from committing a crime is greater than the net gains from not committing a crime. Suppose you are a low-income person, which implies that your u is likely to be low. As income inequality goes up and your wealthy neighbor becomes even wealthier, you may find that you can gain more by stealing from your wealthy neighbor. In other words, the difference between us and u is likely to increase as inequality goes up, and you are more likely to find that the expected gains from committing a crime is greater than the expected gains from not committing a crime.
Skip to 1 minute and 7 seconds Therefore, economists argued, more individuals will likely to become criminal as inequality goes up. To test this prediction using actual data, we first have to collect data on inequality and crime. In my research project, I obtained data on the Gini coefficient, which is a widely-used measure of inequality, and the rate of larceny in the large U.S. counties in 2000. The data points look like this. What do you think the relationship between unemployment and larceny is, based on this figure? I would say the relationship is positive and when we draw the best fitted line using Microsoft Excel, the result show that the relationship is positive indeed. The best fitted line supports the prediction that more inequality leads to more larceny.
Skip to 1 minute and 59 seconds Furthermore, the regression results suggest that a 0.01 increase in the Gini coefficient will lead to 63 more larcenies per 100,000 residents. This is an overly simplified version of an empirical analysis on inequality and crime, but it still gives us an idea of how to run a basic analysis. We first collect data on inequality and crime, and run linear regression to quantify the relationship between inequality and crime. The line that best fits these data points allows us to make predictions of crime rates, based on the level of inequality. Up to this point, everything looks pretty straightforward. However, there is a big limitation in our empirical analysis so far.
Skip to 2 minutes and 49 seconds Linear regression can be very helpful in allowing us to make predictions, but it’s not very helpful in telling us whether more inequality causes more crime. In other words, linear regression can be very helpful and very good at picking up correlation, but not very helpful in finding out the causation. Suppose that in areas with high inequality, we see high crime rates. Does this mean that more inequality cause more crime? Not really, right? There may be another factor that’s associated with both inequality and crime. For example, a high share of youth population may cause both inequality and crime to go up.
Skip to 3 minutes and 35 seconds In this case, when we regress crime rates on the level of inequality, even if the causal effect of inequality on crime is zero, we will still have positive relationship between inequality and crime. It’s because in areas with many many young people will have high inequality and high crime, in areas with very few young people will have low crime and low inequality. To account for such possibilities, we usually include other variables that we believe may influence our outcome variable in our linear regression equation.
Skip to 4 minutes and 11 seconds For example, instead of trying to find a simple line that fits our data on inequality in larceny, we may want to include more variables in the equation to separate the effects of inequality on crime and the effects of other factors on crime. To intuitively describe what we are trying to achieve in this simple extension, in our last example, we are trying to compare crime rates in cities with different levels inequality, and trying to see whether crime rates were higher in cities with high inequality or low inequality.
Skip to 4 minutes and 46 seconds This time, we are trying to compare crimes rates in cities that have different levels inequality but are comparable in other attributes such as the share of youth, the share of low-income population, poverty rate, and so on. If we still find that crime rates are higher in cities with high inequality, then we can be more confident that high inequality was the main driving force for the high crime rate in such cities. So is the problem solved if we just keep adding more variables in the equation? The answer is no. Sometimes data on some important characteristics that should have large impacts on crime and need to be included in the estimating equation will not be available at all.
Skip to 5 minutes and 34 seconds For example, many believe that how much the public trust and cooperate with police should have large impacts on crime. And I want to include this information in my regression equation. But reliable and accurate data on this public trust measure is usually not available. What are we do in this case? In the next video, we will see how this problem can be mitigated by using something called panel data.
If the positive relationship between larceny and inequality is in fact driven by another factor that is correlated with both larceny and inequality (e.g., the share of youth population), including the share of youth population as another explanatory variable in the linear regression can help separate the effect of inequality on crime.
In the next step, we will see how to run linear regression with multiple explanatory variables using Microsoft Excel.
© Songman Kang, Hanyang University