10) Multiple Regression¶
Vitor Kamada
econometrics.methods@gmail.com
Last updated 8-2-2020
10.1) What is a multiple regression model?¶
Multiple regression is a model that measures how explanatory variables \(X_1, X_2,...,X_k\) are associated with a response variable \(Y\).
where the error term (\(\epsilon\)) follows normal distribution with mean 0 and variance \(\sigma^2_\epsilon\):
10.2) What is the consequence of running a simple regression rather than a multiple regression?¶
Assume the real model is:
where, \(m\) stands for multiple regression.
A simple regression assumes that all other explanatory variables are inside the error term:
where \(s\) stands for simple regression.
The error term is the sum of all omitted variables in the regression. If these omitted variables are uncorrelated with the explanatory variable of interest (\(X_1\)), then \(\beta_{1s} = \beta_{1m}\).
Note that this is a special case:
where \(X_1\) is randomized.
Therefore, for observational studies, \(\beta_{1s}\) is likely to be different than \(\beta_{1m}\). In this case, we say that \(\beta_{1s}\) is biased and captures spurious effects.
10.3) How to use a multiple regression to check if an experimental study is reliable?¶
Based on theory, if the treatment variable (T) was randomized, then the treatment variable (T) will be independent of other factors:
In an experiment, the addition of other factors in the regression cannot affect the estimation of the coefficient of the treatment variable (\(\beta_{1s}\)). If you see substantial changes in \(\beta_{1m}\), you can infer that you are not working with experimental data.
Note that in observational studies, you must always control for other factors. Otherwise, you will have the omitted variable bias problem.
We run simple and multiple regressions using the experimental data from Bertrand & Mullainathan (2004), and we can conclude that the treatment variable was in fact randomized:
# Open data set from Bertrand & Mullainathan (2004)
import numpy as np
import pandas as pd
path = "https://github.com/causal-methods/Data/raw/master/"
df = pd.read_stata(path + "lakisha_aer.dta")
# Simple Regression
df['Intercept'] = 1
df['Treatment'] = np.where(df['race'] =='b', 1, 0)
import statsmodels.api as sm
simple = sm.OLS(df['call'], df[['Intercept', 'Treatment']],
missing='drop').fit()
print(simple.summary().tables[1])
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.0965 0.006 17.532 0.000 0.086 0.107
Treatment -0.0320 0.008 -4.115 0.000 -0.047 -0.017
==============================================================================
# Multiple Regression
other_factors = ['yearsexp', 'volunteer', 'military',
'email', 'workinschool', 'honors',
'computerskills', 'specialskills']
multiple_reg = sm.OLS(df['call'],
df[['Intercept', 'Treatment'] + other_factors],
missing='drop').fit()
print(multiple_reg.summary().tables[1])
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 0.0603 0.013 4.741 0.000 0.035 0.085
Treatment -0.0311 0.008 -4.026 0.000 -0.046 -0.016
yearsexp 0.0029 0.001 3.611 0.000 0.001 0.005
volunteer -0.0033 0.011 -0.300 0.764 -0.025 0.018
military -0.0032 0.014 -0.219 0.827 -0.031 0.025
email 0.0140 0.011 1.254 0.210 -0.008 0.036
workinschool 0.0012 0.009 0.138 0.891 -0.016 0.018
honors 0.0652 0.018 3.695 0.000 0.031 0.100
computerskills -0.0206 0.011 -1.922 0.055 -0.042 0.000
specialskills 0.0623 0.008 7.355 0.000 0.046 0.079
==================================================================================
10.4) In an observational study, are the results from multiple regression more credible than simple regression?¶
Overall, the results of a multiple regression are more credible in the sense of suffering less from the problem of omitted variable bias. However, even controlling for many factors, the estimated coefficients are likely to be biased.
10.5) In an observational study, comparing simple and multiple regressions, how different might be the coefficients?¶
It might be a big difference, a change of sign, leading to completely divergent explanations.
For example, using data from Meyersson (2014), a simple regression indicates that in regions controlled by the Islamic party in Turkey, 2.58% less females complete high school compared to regions controlled by a secular party. You might believe that religion limits the educational opportunities of females.
However, a multiple regression indicates the opposite, in regions controlled by the Islamic party in Turkey, 1.44% more females complete high school compared to regions controlled by a secular party. One explanation is that Islamic religion in fact improves the female high school completion. Maybe poverty is a confound factor that drives higher level of religiosity and poor educational outcome. Therefore, in the simple regression, maybe the religion variable is capturing the poverty effect rather than the true direct effect of religion.
Note that both results from simple and multiple regression are statistically significant. The magnitudes of the coefficients are also relevant, as the proportion of females between 15 to 20 years old that completes high school is about only 15.4% in Turkey.
# Load data from Meyersson (2014)
df1 = pd.read_stata(path + "regdata0.dta")
# Simple Regression
df1['Intercept'] = 1
simple_reg = sm.OLS(df1['hischshr1520f'], df1[['Intercept', 'i94']],
missing='drop').fit()
print(simple_reg.summary().tables[1])
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.1662 0.002 83.813 0.000 0.162 0.170
i94 -0.0258 0.006 -4.505 0.000 -0.037 -0.015
==============================================================================
# Multiple Regression
control = ['lpop1994', 'ageshr19', 'merkezi']
mult_reg = sm.OLS(df1['hischshr1520f'],
df1[['Intercept', 'i94', 'vshr_islam1994'] + control],
missing='drop').fit()
print(mult_reg.summary().tables[1])
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 0.2835 0.014 19.919 0.000 0.256 0.311
i94 0.0144 0.006 2.404 0.016 0.003 0.026
vshr_islam1994 -0.0801 0.013 -6.113 0.000 -0.106 -0.054
lpop1994 0.0104 0.002 6.710 0.000 0.007 0.013
ageshr19 -0.5246 0.019 -28.095 0.000 -0.561 -0.488
merkezi 0.0582 0.004 15.284 0.000 0.051 0.066
==================================================================================
# Proportion of females between 15 to 20 year that completes high school
df1['hischshr1520f'].mean()
0.1543141
Exercises¶
1| Interpret the 4 values of the contingency table below. Specifically, state the meaning and compare the values.
The variable ‘h’: 1 = higher quality curriculum vitae; 0 = lower quality curriculum vitae. This variable was randomized.
Other variables were previously defined.
contingency_table = pd.crosstab(df['Treatment'], df['h'],
values=df['call'], aggfunc='mean')
contingency_table
h | 0.0 | 1.0 |
---|---|---|
Treatment | ||
0 | 0.084983 | 0.107931 |
1 | 0.061881 | 0.067048 |
2| I created an interaction variable ‘h_Treatment’ that is the pairwise multiplication of the variable ‘h’ and ‘treatment’.
How can you use the coefficients of the regression below to get the values of the contingency table in exercise 1? Show the calculations.
df['h_Treatment'] = df['h']*df['Treatment']
interaction = sm.OLS(df['call'],
df[['Intercept', 'Treatment', 'h', 'h_Treatment'] ],
missing='drop').fit()
print(interaction.summary().tables[1])
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 0.0850 0.008 10.895 0.000 0.070 0.100
Treatment -0.0231 0.011 -2.094 0.036 -0.045 -0.001
h 0.0229 0.011 2.085 0.037 0.001 0.045
h_Treatment -0.0178 0.016 -1.142 0.253 -0.048 0.013
===============================================================================
3| Write a code to get a contingency table below:
firstname\ h |
0.0 |
1.0 |
---|---|---|
Aisha |
0.010000 |
0.037500 |
Allison |
0.121739 |
0.068376 |
Inside the table are the callback rates broken by Curriculum Vitae quality. What is the callback rate for Kristen and Lakisha? Why the rates are so different? Could we justify the rate difference, arguing that one is more educated and qualified than other?
Reference¶
Adhikari, A., DeNero, J. (2020). Computational and Inferential Thinking: The Foundations of Data Science. Link
Bertrand, Marianne, and Sendhil Mullainathan. (2004). Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. American Economic Review, 94 (4): 991-1013.
Diez, D. M., Barr, C. D., Çetinkaya-Rundel, M. (2014). Introductory Statistics with Randomization and Simulation. Link
Lau, S., Gonzalez, J., Nolan, D. (2020). Principles and Techniques of Data Science. Link
Meyersson, Erik. 2014. “Islamic Rule and the Empowerment of the Poor and Pious”. Econometrica, 82(1), 229-269.