Code
library(wooldridge) # To get the data
library(tidyverse) # For modern data science analysis
library(skimr) # For descriptive statistics
library(stargazer) # For professional regression tables
In this case study, we will be using the Differences in Differences (DiD) method to analyze the effect of a garbage incinerator’s location on housing prices. This method is a statistical technique used in econometrics that calculates the effect of a treatment (in this case, the placement of a garbage incinerator) on an outcome (here, housing prices) by comparing the average change over time in the outcome variable for the treatment group to the average change over time for the control group. You can run and extend the analysis of this case study using the Posit cloud.
library(wooldridge) # To get the data
library(tidyverse) # For modern data science analysis
library(skimr) # For descriptive statistics
library(stargazer) # For professional regression tables
In their comprehensive study, Kiel and McClain (1995) delved into the effects that a newly constructed garbage incinerator had on the values of residential properties in the town of North Andover, located in Massachusetts. Their research was extensive and involved the use of data spanning several years. Additionally, they employed a comprehensive econometric analysis to interpret the data.
In our case study, we aim to conduct a similar analysis, albeit with a few modifications. Instead of using data from multiple years, we will limit our scope to two specific years. Furthermore, we will simplify our approach by using less complex models for our analysis. Despite these changes, the core objective of our study aligns with that of Kiel and McClain’s research.
The timeline of the incinerator’s construction plays a crucial role in our study. Post-1978, rumors began to circulate about the potential construction of a new incinerator in North Andover. These rumors materialized into reality in 1981 when the construction of the incinerator commenced. Initially, it was expected that the incinerator would become operational shortly after the beginning of its construction. However, due to unforeseen circumstances, the incinerator only started operating in 1985.
For our analysis, we will be using data on the prices of houses sold in two distinct years: 1978 and 1981. The year 1978 represents the period before the rumors of the incinerator began, while 1981 represents the year when the construction of the incinerator started.
The central hypothesis that we aim to test is that the prices of houses located in close proximity to the incinerator would experience a relative drop compared to the prices of houses situated further away. This hypothesis is based on the assumption that the presence of a garbage incinerator in the vicinity could potentially devalue the surrounding properties due to the associated environmental and health concerns. Through our study, we aim to provide evidence in favor this hypothesis and quantify the impact of the incinerator’s location on housing prices.
To analyze the effect of the incinerator’s location on housing prices, we need data on housing prices in the neighborhood where the incinerator was proposed (treatment group) and in a comparable neighborhood where no incinerator was proposed (control group). We collect data on housing prices for several years before and after the incinerator was proposed.
data(kielmc, package='wooldridge')
help("kielmc", package = "wooldridge")
A data.frame with 321 observations on 25 variables:
skim(kielmc[c('rprice', 'lrprice', 'nearinc', 'year')])
Name | kielmc[c(“rprice”, “lrpri… |
Number of rows | 321 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
rprice | 0 | 1 | 83721.36 | 33118.79 | 26000.00 | 59000.00 | 82000.00 | 100230.41 | 300000.00 | ▇▇▁▁▁ |
lrprice | 0 | 1 | 11.26 | 0.39 | 10.17 | 10.99 | 11.31 | 11.52 | 12.61 | ▁▆▇▃▁ |
nearinc | 0 | 1 | 0.30 | 0.46 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
year | 0 | 1 | 1979.33 | 1.49 | 1978.00 | 1978.00 | 1978.00 | 1981.00 | 1981.00 | ▇▁▁▁▆ |
An inexperienced analyst might only utilize the data from 1981 and estimate a rather simplistic model:
\[ \text {rprice} = \beta_{0} + \beta_{1} \text{nearinc} + e \hspace{1cm} (1) \]
In this equation, ‘nearinc’ is a dummy variable that equals one if the house is located near the incinerator, and zero if not. In this simple regression analysis with a single dummy variable, the intercept represents the average selling price for homes not located near the incinerator. The coefficient on ‘nearinc’ signifies the difference in the average selling price between homes near the incinerator and those that are not.
# Fit regression models:
<- lm(rprice ~ nearinc, data=kielmc, subset=(year==1981))
model1981 <- lm(rprice ~ nearinc, data=kielmc, subset=(year==1978))
model1978 # Get professional regression table:
stargazer(model1981, model1978, type="text")
===================================================================
Dependent variable:
-----------------------------------------------
rprice
(1) (2)
-------------------------------------------------------------------
nearinc -30,688.270*** -18,824.370***
(5,827.709) (4,744.594)
Constant 101,307.500*** 82,517.230***
(3,093.027) (2,653.790)
-------------------------------------------------------------------
Observations 142 179
R2 0.165 0.082
Adjusted R2 0.159 0.076
Residual Std. Error 31,238.040 (df = 140) 29,431.960 (df = 177)
F Statistic 27.730*** (df = 1; 140) 15.741*** (df = 1; 177)
===================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
The regression results of column (1) indicate that homes closer to the incinerator were sold at a lower average price compared to those further away in 1981. The slope coefficient is highly statistically significant, allowing us to reject the hypothesis that the average home values near and far from the incinerator are identical.
However, Equation 1 does not necessarily suggest that the placement of the incinerator is the cause of the lower housing values. Interestingly, if we conduct the same regression for 1978 (prior to any mention of the incinerator), the results of column (2) align with those in column (1). That is, the slope coefficient is also negative. This means that even before the incinerator was a consideration, the average value of a home near the proposed site was already $18,824.37 less than the average value of a home not near the site ($82,517.23). This difference is statistically significant. Thus, the incinerator was constructed in an area where housing values were already lower.
So, how do we determine if the construction of a new incinerator has a negative impact on housing values? The answer lies in observing the change in the coefficient on ‘nearinc’ between 1978 and 1981. The difference in average housing value was significantly larger in 1981 than in 1978 ($30,688.27 versus $18,824.37), even when considered as a percentage of the average value of homes not near the incinerator site. The difference between the two coefficients on ‘nearinc’ is
\[ \hat{\delta}_{1}=-30,688.27-(-18,824.37)=-11,863.9 . \]
This number is our estimate of the impact of the incinerator on the values of homes in its vicinity. In the field of empirical economics, \(\hat{\delta}_{1}\) is often referred to as the difference-in-differences (DD or DID) estimator because it can be expressed as
\[ \hat{\delta}_{1}=\left(\overline{\text { rprice }}_{81, n r}-\overline{\text { rprice }}_{81, f r}\right)-\left(\overline{\text { rprice }}_{78, n r}-\overline{\text { rprice }}_{78, f r}\right), \hspace{1cm} (2) \]
where ‘nr’ denotes “near the incinerator site” and ‘fr’ denotes “farther away from the site.” In other words, \(\hat{\delta}_{1}\) is the difference over time in the average difference of housing prices in the two locations.
To test whether \(\hat{\delta}_{1}\) is statistically different from zero, we need to calculate its standard error using a regression analysis. Indeed, \(\hat{\delta}_{1}\) can be obtained by estimating
\[ \text { rprice }=\beta_{0}+\delta_{0} y 81+\beta_{1} \text { nearinc }+\delta_{1} y 81 \cdot \text { nearinc }+u, \hspace{1cm} (3) \]
Using the data collected over both years:
The estimates of Equation (3) are presented in column (1) of table below.
<- lm(rprice ~ y81 + nearinc + y81*nearinc, data=kielmc)
did stargazer(did, type="text")
===============================================
Dependent variable:
---------------------------
rprice
-----------------------------------------------
y81 18,790.290***
(4,050.065)
nearinc -18,824.370***
(4,875.322)
y81:nearinc -11,863.900
(7,456.646)
Constant 82,517.230***
(2,726.910)
-----------------------------------------------
Observations 321
R2 0.174
Adjusted R2 0.166
Residual Std. Error 30,242.900 (df = 317)
F Statistic 22.251*** (df = 3; 317)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
The previous results can be improved. A logarithmic specification is more plausible since it implies a constant percentage effect on the house values (See column (1) of the table below). We can also add control variables. Kiel and McClain (1995) incorporated incorporated control variables for two compelling reasons. Firstly, the types of homes sold near the incinerator in 1981 might have been systematically different from those sold in the same area in 1978; if this is the case, it’s crucial to control for such characteristics. Secondly, even if the relevant house characteristics remained unchanged, including them can significantly reduce the error variance, which can subsequently decrease the standard error of \(\hat{\delta}_{1}\). In column (2), we control for the age of the houses, using a quadratic. This considerably increases the \(R\)-squared (by reducing the residual variance). The coefficient on \(\delta_{1}\) is now much larger in magnitude, its standard error is lower, and as a result it is statistically significant. Thus, using the logarithmic form and control variables, we estimate that houses near the incinerator depreciated in value by about \(13.2 \%\).
<- lm(log(rprice) ~ nearinc + y81 + nearinc*y81, data=kielmc)
did1 <- lm(log(rprice) ~ nearinc + y81 + nearinc*y81 + age+I(age^2)+log(intst)+log(land)+log(area)+rooms+baths, data=kielmc)
did2
stargazer(did1, did2, type="text")
====================================================================
Dependent variable:
------------------------------------------------
log(rprice)
(1) (2)
--------------------------------------------------------------------
nearinc -0.340*** 0.032
(0.055) (0.047)
y81 0.193*** 0.162***
(0.045) (0.028)
age -0.008***
(0.001)
I(age2) 0.00004***
(0.00001)
log(intst) -0.061*
(0.032)
log(land) 0.100***
(0.024)
log(area) 0.351***
(0.051)
rooms 0.047***
(0.017)
baths 0.094***
(0.028)
nearinc:y81 -0.063 -0.132**
(0.083) (0.052)
Constant 11.285*** 7.652***
(0.031) (0.416)
--------------------------------------------------------------------
Observations 321 321
R2 0.246 0.733
Adjusted R2 0.239 0.724
Residual Std. Error 0.338 (df = 317) 0.204 (df = 310)
F Statistic 34.470*** (df = 3; 317) 84.915*** (df = 10; 310)
====================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
To consolidate your understanding, let us revise the following video on the basics of the simple differences in differences estimator.
Source: Nicolai Kuminoff
The DiD estimate was negative, indicating that the incinerator’s location had a negative effect on housing prices in the neighborhood. This finding supports the residents’ concerns about the impact of the incinerator on their property values.
It’s important to note that the DiD method assumes that, in the absence of the treatment, the average outcomes for the treatment and control groups would have followed the same trend over time. This assumption, known as the parallel trends assumption, cannot be tested directly and is a potential source of bias in DiD estimates.
The DiD method provides a powerful tool for causal inference in observational studies. In this case, it allowed us to estimate the causal effect of a garbage incinerator’s location on housing prices, providing valuable evidence in discussions about the siting of potentially harmful facilities.