8) Experiment¶
econometrics.methods@gmail.com
Last updated 9-15-2020
8.1) What is an experiment?¶
Experiment is a method that uses randomization to produce data that reveal causation.
A typical experiment consists in:
Selecting a random sample from a population
Assigning subjects at random to treatments
Comparing the responses of subjects among the treatments
The treatment is the variable manipulated by the experimenter, in order to see its effect on subjects.
In observational studies, the treatment variable is not randomized. Without randomization, it is not possible to know if observed difference in the response is due to the treatment or to confounding factors.
An observational study cannot answer if a drug cure or not a disease. The treatment group might be composed by younger, sport enthusiasts, healthy patients; whereas the control group by old, lazy, and unhealthy patients. There is no way to separate the effects of the drug from the effects of age, practicing physical exercise, and initial health status.
However, when the treatment variable is randomized, control group and treatment group are on average like to each other. Therefore, automatically all confounding factors are eliminated.
8.2) What is average causal effect?¶
As far as the treatment variable is randomized, the average causal effect is the difference in group means between the treatment (\(T=1\)) and control group (\(T=0\)). Let’s assume that \(Y\) is the outcome variable.
where \(c\) is the control group, that is, when \(T=0\).
8.3) Why is easy to establish the causal effect of a drug, but hard to establish if racial discrimination is real?¶
A drug can be assigned at random to subjects, but in practical terms, a skin color not, because it is a property of the subjects. The drug question represents the typical research question in Hard Science; while the racial discrimination is the typical question of Human Science.
When we see lower wages among Blacks, it is hard to distinguish if this is an effect of racial discrimination, or lower level of education, networking, etc.
However, Bertrand & Mullainathan (2004) have a clever idea of manipulating race on curriculum vitae (CVs). They randomly assigned a Black sounding name (ex: Lakish or Jamal) to half of the CVs and a White sounding name (ex: Emily or Greg) to the other half. They sent the CVs for real jobs in Chicago and Boston. The result was that only 6.4% of Blacks received a callback for interview; whereas 9.6% of Whites received a callback.
Let’s open the dataset from Bertrand & Mullainathan (2004).
import pandas as pd
path = "https://github.com/causal-methods/Data/raw/master/"
df = pd.read_stata(path + "lakisha_aer.dta")
df
id | ad | education | ofjobs | yearsexp | honors | volunteer | military | empholes | occupspecific | ... | compreq | orgreq | manuf | transcom | bankreal | trade | busservice | othservice | missind | ownership | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | b | 1 | 4 | 2 | 6 | 0 | 0 | 0 | 1 | 17 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
1 | b | 1 | 3 | 3 | 6 | 0 | 1 | 1 | 0 | 316 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
2 | b | 1 | 4 | 1 | 6 | 0 | 0 | 0 | 0 | 19 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
3 | b | 1 | 3 | 4 | 6 | 0 | 1 | 0 | 1 | 313 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
4 | b | 1 | 3 | 3 | 22 | 0 | 0 | 0 | 0 | 313 | ... | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | Nonprofit |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4865 | b | 99 | 3 | 2 | 1 | 0 | 0 | 0 | 1 | 313 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | Private |
4866 | a | 99b | 4 | 4 | 6 | 0 | 0 | 0 | 0 | 285 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | |
4867 | a | 99b | 4 | 6 | 8 | 0 | 1 | 0 | 0 | 21 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | |
4868 | a | 99b | 4 | 4 | 2 | 0 | 1 | 1 | 0 | 267 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | |
4869 | a | 99b | 4 | 3 | 7 | 0 | 0 | 0 | 1 | 274 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4870 rows × 65 columns
Let’s restrict the analysis to the variables ‘call’ and ‘race’.
call: 1 = applicant was called back to interview; and 0 otherwise.
race: w = White, and b = Black.
# Round 4 decimals
pd.set_option('precision', 4)
import numpy as np
callback = df.loc[:, ('call', 'race')]
callback.groupby('race').agg([np.size, np.mean])
call | ||
---|---|---|
size | mean | |
race | ||
b | 2435.0 | 0.0645 |
w | 2435.0 | 0.0965 |
8.4) How to check the integrity of an experimental study?¶
Based on theory, we know that the randomization of the treatment variable will produce a control group like the treatment group.
Let’s check the proportion of Blacks and Whites with college degree in the dataset from Bertrand & Mullainathan (2004)
Originally, college graduate was coded as 4 in the variable ‘education’. 3 = some college. 2 = high school graduate. 1 = some high school. 0 not reported.
Let’s create the variable ‘college’ = 1, if a person completes a college degree; and 0 otherwise.
df['college'] = np.where(df['education'] == 4, 1, 0)
We can see that 72.2% of Black Applicants have a college degree. The proportion of Whites with college degree is very similar 71.6%.
college = df.loc[:, ('college', 'race')]
college.groupby('race').agg([np.size, np.mean])
college | ||
---|---|---|
size | mean | |
race | ||
b | 2435 | 0.7228 |
w | 2435 | 0.7162 |
Let’s check this statement for other factors in the CVs. The names of the variables are self-explanatory, and more information can be obtained reading the paper from Bertrand & Mullainathan (2004).
resume = ['college', 'yearsexp', 'volunteer', 'military',
'email', 'workinschool', 'honors',
'computerskills', 'specialskills']
both = df.loc[:, resume]
both.head()
college | yearsexp | volunteer | military | workinschool | honors | computerskills | specialskills | ||
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 6 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 6 | 1 | 1 | 1 | 1 | 0 | 1 | 0 |
2 | 1 | 6 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
3 | 0 | 6 | 1 | 0 | 1 | 0 | 0 | 1 | 1 |
4 | 0 | 22 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
Let’s use a different code to calculate the mean of the variables for the whole sample (both Whites and Blacks) and broken samples between Blacks and Whites.
See that the average years of experience (yearsexp) is 7.84 for the whole sample, 7.82 for Blacks, and 7.85 for Whites.
If you check all variables the mean values for Blacks are very closer to the mean values for Whites. This is the consequence of randomization.
We also calculate the standard deviation (std), a measure of variation around the mean. Note that the standard deviation is pretty much the same between the whole sample and split samples. Like the mean, you don’t suppose to see much difference among standard deviations in the case of experimental data.
The standard deviation of the variable years of experience is 5 years. We can state roughly the most part of observations (about 68%) is between 1 std below the mean and 1 std above the mean, that is, between [2.84, 12.84].
black = both[df['race']=='b']
white = both[df['race']=='w']
summary = {'mean_both': both.mean(), 'std_both': both.std(),
'mean_black': black.mean(), 'std_black': black.std(),
'mean_white': white.mean(), 'std_white': white.std()}
pd.DataFrame(summary)
mean_both | std_both | mean_black | std_black | mean_white | std_white | |
---|---|---|---|---|---|---|
college | 0.7195 | 0.4493 | 0.7228 | 0.4477 | 0.7162 | 0.4509 |
yearsexp | 7.8429 | 5.0446 | 7.8296 | 5.0108 | 7.8563 | 5.0792 |
volunteer | 0.4115 | 0.4922 | 0.4144 | 0.4927 | 0.4086 | 0.4917 |
military | 0.0971 | 0.2962 | 0.1018 | 0.3025 | 0.0924 | 0.2897 |
0.4793 | 0.4996 | 0.4797 | 0.4997 | 0.4789 | 0.4997 | |
workinschool | 0.5595 | 0.4965 | 0.5610 | 0.4964 | 0.5581 | 0.4967 |
honors | 0.0528 | 0.2236 | 0.0513 | 0.2207 | 0.0542 | 0.2265 |
computerskills | 0.8205 | 0.3838 | 0.8324 | 0.3735 | 0.8086 | 0.3935 |
specialskills | 0.3287 | 0.4698 | 0.3273 | 0.4693 | 0.3302 | 0.4704 |
8.5) Test if Whites and Blacks have the same average years of experience.¶
The average years of experience for Whites is 7.82, while for Blacks is 7.85. We can see that they are pretty much the same, but let’s be formal and carry out a statistical test:
The t-statistic is:
where the pool standard deviation (\(s_p\)) is:
The t-statistic is 0.18 and respective p-value is 0.85. Therefore, we cannot reject the \(H_0\).
from scipy.stats import *
white = df[df['race'] == 'w']
black = df[df['race'] == 'b']
TwoTail=ttest_ind(white['yearsexp'],black['yearsexp'], equal_var=True)
TwoTail
Ttest_indResult(statistic=0.18461970685747395, pvalue=0.8535350182481283)
8.6) Test if the proportion of college degree is the same for Whites and Blacks.¶
Let \(p_{w}\) be the proportion of Whites with a college degree, and \(p_{b}\) the proportion of Blacks with a college degree.
The null hypothesis (\(H_0\)) is:
The z-statistic is:
where \(se(\hat{p}_{w}-\hat{p}_{b})\) is the standard error of the difference between the sample proportions:
where \(n_w\) is the sample size of Whites and \(n_b\) is the sample size of Blacks.
The estimated proportions are:
Then:
The p-value of the z-statistic is 0.61. Therefore, we cannot reject the null hypothesis (\(H_0\)) at any reasonable level of significance.
from statsmodels.stats.proportion import proportions_ztest
# number of college degree
count = np.array([ sum(white['college']), sum(black['college']) ])
# sample size
nobs = np.array([ len(white['college']), len(black['college']) ])
proportions_ztest(count, nobs)
(-0.510360512459463, 0.6097989158445807)
8.7) Why the randomization of the treatment variable makes control group and treatment group similar?¶
The answer is the Law of Large Numbers (LLN).
A sample average can be brought as close as the average in the population from which it is drawn simply by enlarging the sample.
If the samples are large enough, those in randomly assigned treatment and control samples will be similar, because both groups come from the same population.
8.8) When the difference in group means is biased and not capture the average causal effect?¶
Unless the treatment variable (T) is randomized, the difference in group means will be upward or downward biased.
Let’s assume that for a person \(i\) the treatment outcome (\(Y_{ti}\)) is equal to the control outcome (\(Y_{ci}\)) plus the causal effect (\(\alpha\)).
Then, the difference in group means is:
Therefore:
If the treatment variable (\(T\)) is randomized, the outcome variable (\(Y\)) will be independent of the treatment (\(T\)).
Consequently, the bias term vanishes:
8.9) Does association imply causation?¶
No. Causation can only be inferred from a randomized experiment. Consumption of ice cream is positive correlated with skin cancer, but consumption of ice cream is not a major factor to cause skin cancer. It looks that sun light increases both the consumption of ice cream and the incidence of skin cancer.
Exercises¶
1| In the literature of racial discrimination, there are more than 1000 observational studies for each experimental study. Suppose you read 100 observational studies that indicate that racial discrimination is real. Suppose that you also read 1 experimental study that claims no evidence of racial discrimination. Are you more inclined to accept the result of 100 observational studies or the result of the experimental study? Justify your answer.
2| Interpret the 4 values of the contingency table below. Specifically, state the meaning and compare the values. Argue if there is or not a stronger evidence of racial discrimination.
‘race’: w = White, and b = Black.
‘h’: 1 = higher quality curriculum vitae; 0 = lower quality curriculum vitae. This variable was randomized as well.
contingency_table = pd.crosstab(df['race'], df['h'],
values=df['call'], aggfunc='mean')
contingency_table
h | 0.0 | 1.0 |
---|---|---|
race | ||
b | 0.0619 | 0.0670 |
w | 0.0850 | 0.1079 |
3| A student is worried about the time she spends in traffic getting to the university. She times the drive for a couple of weeks and finds that it averages 30 minutes.
a) The next day, she tries public transit and it takes 35 minutes. The next day, she’s back on the roads, convinced that driving is quicker. Does her decision make sense?
b) If the student decides to do more testing, how should she decide on the mode of transportation? Should she, for example, drive for a week and then take public transit for a week? What advice would you offer?
4| Give 3 examples of spurious correlation that people might think that is a case of causation. Cite the reference/source of your examples.
5| Use the data from Bertrand & Sendhil Mullainathan (2004) to test if Whites and Blacks have the same probability of having an e-mail in the resume. Write the null hypothesis (\(H_0\)). Interpret the result.
6| Use the data from Bertrand & Sendhil Mullainathan (2004) to test if Whites and Blacks have the same computer skills. Write the null hypothesis (\(H_0\)). Interpret the result.
Reference¶
Adhikari, A., DeNero, J. (2020). Computational and Inferential Thinking: The Foundations of Data Science
Bertrand, Marianne, and Sendhil Mullainathan. (2004). Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. American Economic Review, 94 (4): 991-1013.
Diez, D. M., Barr, C. D., Çetinkaya-Rundel, M. (2014). Introductory Statistics with Randomization and Simulation.
Lau, S., Gonzalez, J., Nolan, D. (2020). Principles and Techniques of Data Science.