--- title: "Theory" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Theory} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} header-includes: - \usepackage{amsmath} - \usepackage{amssymb} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## 1. Conceptual Framework ### 1.1 Notation and Key Concepts - $i$: Index for individual unit. - $t$: Time period. - $D_{i,t}$: Binary indicator for treatment. We assume throughout that treatment is received permanently once it has been received for the first time. In other words, $D_{i,t}=1 \implies D_{i,t+1}=1$. - $G_i$: Treatment cohort, i.e., the time at which treatment is first received by $i$. That is, $G_i = g \implies D_{i,t}=1, \forall t\geq g$. Note: If treatment is not received, $G_i = \infty$. - $Y_{i,t}$: Observed outcome of interest. - $Y_{i,t}(g)$: Counterfactual outcome if treatment cohort were $G_i=g$. ### 1.2 Goal Our goal is to identify the average treatment effect on the treated (ATT), for cohort $g$ at event time $e \equiv t-g$, which is defined by: $$ \text{ATT}_{g,e} \equiv \mathbb{E}[Y_{i,g+e}(g) - Y_{i,g+e}(\infty) | G_i = g] $$ We may also be interested in the average ATT across treated cohorts for a given event time: $$ \text{ATT}_{e} \equiv \sum_g \omega_{g,e} \text{ATT}_{g,e}, \quad \omega_{g,e} \equiv \frac{\sum_i 1\{G_i=g\}}{\sum_i 1\{G_i < \infty\}} $$ Lastly, we may be interested in the average across certain event times of the average ATT across cohorts: $$ \text{ATT}_{E} \equiv \frac{1}{|E|} \sum_{e \in E} \text{ATT}_{e} $$ where $E$ is a set of event times, e.g., $E = \{1,2,3\}$. ### 1.3 Difference-in-differences **Control group:** For the treated cohort $G_i = g$, let $C_{g,e}$ denote the corresponding set of units $i$ that belong to a control group. - At a minimum, the control group must satisfy $i \in C_{g,e} \implies G_i > \max\{g, g+e\}$. This says that the control group must belong to a later cohort than the treated group of interest, and the control group must not have been treated yet by the event time of interest. **Base event time:** We consider a reference event time from before treatment $b$, which satisfies $b<0$. **Difference-in-differences:** The difference-in-differences estimand is defined by, $$ \text{DiD}_{g,e} \equiv \mathbb{E}[Y_{i,g+e} - Y_{i,g+b} | G_i = g] - \mathbb{E}[Y_{i,g+e} - Y_{i,g+b} | i \in C_{g,e}] $$ ## 2. Identification Throughout this section, our goal is to identify $\text{ATT}_{g,e}$ for some treated cohort $g$ and some event time $e$. We take the base event time $b<0$ as given. ### 2.1 Identifying Assumptions **Parallel Trends:** $$ \mathbb{E}[Y_{i,g+e}(\infty) - Y_{i,g+b}(\infty) | G_i = g] = \mathbb{E}[Y_{i,g+e}(\infty) - Y_{i,g+b}(\infty) | i \in C_{g,e}] $$ This says that, in the absence of treatment, the treatment and control groups would have experienced the same average change in their outcomes between event time $b$ and event time $e$. **No Anticipation:** $$ \mathbb{E}[ Y_{i,g+b}(g) | G_i = g] = \mathbb{E}[ Y_{i,g+b}(\infty) | G_i = g] $$ This says that, at base event time $b$, the observed outcome for the treated cohort would have been the same if it had instead been assigned to never receive treatment. ### 2.2 Proof of Identification by DiD We prove that $\text{DiD}_{g,e}$ identifies $\text{ATT}_{g,e}$ in three steps: **Step 1:** Add and subtract $Y_{i,g+b}(\infty)$ from the ATT definition: $$ \text{ATT}_{g,e} \equiv \mathbb{E}[Y_{i,g+e}(g) - Y_{i,g+e}(\infty) | G_i = g] $$ $$ = \mathbb{E}[Y_{i,g+e}(g) - Y_{i,g+b}(\infty) | G_i = g] - \mathbb{E}[Y_{i,g+e}(\infty) - Y_{i,g+b}(\infty) | G_i = g] $$ **Step 2:** Assume that Parallel Trends holds. Then, we can replace the conditioning set $G_i=g$ with the conditioning set $i \in C_{g,e}$ in the second term: $$ \text{ATT}_{g,e} = \mathbb{E}[Y_{i,g+e}(g) - Y_{i,g+b}(\infty) | G_i = g] - \mathbb{E}[Y_{i,g+e}(\infty) - Y_{i,g+b}(\infty) | G_i = g] $$ $$ = \mathbb{E}[Y_{i,g+e}(g) - Y_{i,g+b}(\infty) | G_i = g] - \mathbb{E}[Y_{i,g+e}(\infty) - Y_{i,g+b}(\infty) | i \in C_{g,e}] $$ **Step 3:** Assume that No Anticipation holds. Then, we can replace $Y_{i,g+b}(\infty)$ with $Y_{i,g+b}(g)$ if the conditioning set is $G_i = g$: $$ \text{ATT}_{g,e} = \mathbb{E}[Y_{i,g+e}(g) - Y_{i,g+b}(\infty) | G_i = g] - \mathbb{E}[Y_{i,g+e}(\infty) - Y_{i,g+b}(\infty) | i \in C_{g,e}] $$ $$ = \mathbb{E}[Y_{i,g+e}(g) - Y_{i,g+b}(g) | G_i = g] - \mathbb{E}[Y_{i,g+e}(\infty) - Y_{i,g+b}(\infty) | i \in C_{g,e}] $$ where the final expression is $\text{DiD}_{g,e}$. Thus, we have shown that $\text{DiD}_{g,e} = \text{ATT}_{g,e}$ if Parallel Trends and No Anticipation hold. ## 3. The `DiDge(...)` Command $\text{DiD}_{g,e}$ is estimated in `DiDforBigData` by the `DiDge(...)` command, which is documented [here](https://setzler.github.io/DiDforBigData/reference/DiDge.html). ### 3.1 Automatic Control Group Selection **All:** The largest valid control group is $C_{g,e} \equiv \{ i : G_i > \max\{g, g+e\}\}$. To use this control group, specify `control_group = "all"` in the `DiDge(...)` command. This option is selected by default. Two alternatives can be specified. **Never-treated:** The never-treated control group is defined by $C_{g,e} \equiv \{ i : G_i = \infty \}$. To use this control group, specify `control_group = "never-treated"` in the `DiDge(...)` command. **Future-treated:** The future-treated control group is defined by $C_{g,e} \equiv \{ i : G_i > \max\{g, g+e\} \text{ and } G_i < \infty\}$. To use this control group, specify `control_group = "future-treated"` in the `DiDge(...)` command. **Base event time:** The base event time can be specified using the `base_event` argument in `DiDge(...)`, where `base_event = -1` by default. ### 3.2 DiD Estimation for a Single $(g,e)$ Combination The `DiDge()` command performs the following sequence of steps: **Step 1.** Define the $(g,e)$-specific sample of treated and control units, $S_{g,e} \equiv \{G_i=g\} \cup \{i \in C_{g,e}\}$. Drop any observations that do not satisfy $i \in S_{g,e}$. **Step 2.** Construct the within-$i$ differences $\Delta Y_{i,g+e} \equiv Y_{i,g+e} - Y_{i,g+b}$ for each $i \in S_{g,e}$. **Step 3.** Estimate the simple linear regression $\Delta Y_{i,g+e} = \alpha_{g,e} + \beta_{g,e} 1\{G_i =g\} + \epsilon_{i,g+e}$ by OLS for $i \in S_{g,e}$. The OLS estimate of $\beta_{g,e}$ is equivalent to $\text{DiD}_{g,e}$. The standard error provided by OLS for $\beta_{g,e}$ is equivalent to the standard error from a two-sample test of equal means for the null hypothesis $$\mathbb{E}[\Delta Y_{i,g+e} | G_i = g] = \mathbb{E}[\Delta Y_{i,g+e} | i \in C_{g,e}] $$ which is equivalent to testing that $\text{ATT}_{g,e}=0$. ## 4. The `DiD(...)` Command `DiDforBigData` uses the `DiD(...)` command to estimate $\text{DiD}_{g,e}$ for all available cohorts $g$ across a range of possible event times $e$; `DiD(...)` is documented [here](https://setzler.github.io/DiDforBigData/reference/DiD.html). ### 4.1 DiD Estimation for All Possible $(g,e)$ Combinations `DiD(...)` uses the `control_group` and `base_event` arguments the same way as `DiDge(...)`. `DiD(...)` also uses the `min_event` and `max_event` arguments to choose the minimum and maximum event times $e$ of interest. If these arguments are not specified, it assumes all possible event times are of interest. In practice, `DiD(...)` completes the following steps: **Step 1.** Determine all possible combinations of $(g,e)$ available in the data. The `min_event` and `max_event` arguments allow the user to restrict the minimum and maximum event times $e$ of interest. **Step 2.** In parallel, for each $(g,e)$ combination, construct the corresponding control group $C_{g,e}$ the same way as `DiDge(...)`. Drop any $(g,e)$ combination for which the control group is empty. **Step 3.** Within each $(g,e)$-specific process, define the $(g,e)$-specific sample of treated and control units, $S_{g,e} \equiv \{G_i=g\} \cup \{i \in C_{g,e}\}$. Drop any observations that do not satisfy $i \in S_{g,e}$. **Step 4.** Within each $(g,e)$-specific process, construct the within-$i$ differences $\Delta Y_{i,g+e} \equiv Y_{i,g+e} - Y_{i,g+b}$ for each $i$ that remains in the sample. **Step 5.** Within each $(g,e)$-specific process, estimate $\Delta Y_{i,g+e} = \alpha_{g,e} + \beta_{g,e} 1\{G_i =g\} + \epsilon_{i,g+e}$ by OLS. The OLS estimate of $\beta_{g,e}$ is equivalent to $\text{DiD}_{g,e}$. The standard error provided by OLS for $\beta_{g,e}$ is equivalent to the standard error from a two-sample test of equal means for the null hypothesis $$\mathbb{E}[\Delta Y_{i,g+e} | G_i = g] = \mathbb{E}[\Delta Y_{i,g+e} | i \in C_{g,e}] $$ which is equivalent to testing that $\text{ATT}_{g,e}=0$. Note that $\text{ATT}_{g,e}=0$ is tested as a single hypothesis for each $(g,e)$ combination; no adjustment for multiple hypothesis testing is applied. ### 4.2 Estimate the Average DiD across Cohorts and Event Times Aside from estimating each $\text{DiD}_{g,e}$, `DiD(...)` also estimates $\text{DiD}_{e}$ for each $e$ included in the event times of interest. To do so, `DiD(...)` completes the following steps: **Step 1.** At the end of the $(g,e)$-specific estimation in parallel described above, it returns the various $(g,e)$-specific samples of the form $S_{g,e} \equiv \{G_i=g\} \cup \{i \in C_{g,e}\}$. **Step 2.** It defines an indicator for corresponding to cohort $g$, then stacks all of the samples $S_{g,e}$ that have the same $e$. Note that the same $i$ can appear multiple times due to membership in both $S_{g_1,e}$ and $S_{g_2,e}$, so the distinct observations are distinguished by the indicators for $g$. **Step 3.** It estimates $\Delta Y_{i,g+e} = \sum_g \alpha_{g,e} + \sum_g \beta_{g,e} 1\{G_i =g\} + \epsilon_{i,g+e}$ by OLS for the stacked sample across $g$. **Step 4.** It constructs $\text{DiD}_e = \sum_g \omega_{g,e} \beta_{g,e}$, where $\omega_{g,e} \equiv \frac{\sum_i 1\{G_i=g\}}{\sum_i 1\{G_i < \infty\}}$. Since each $\beta_{g,e}$ is an estimate of the corresponding $\text{ATT}_{g,e}$, it follows that $\text{DiD}_e$ is an estimate of the weighted average $\text{ATT}_{e} \equiv \sum_g \omega_{g,e} \text{ATT}_{g,e}$. **Step 5.** To test the null hypothesis that $\text{ATT}_{e} = 0$, it defines $\bar\beta_e = (\beta_{g,e})_g$ and $\bar\omega_e = (\omega_{g,e})_g$. Note that $\text{DiD}_e = \bar\omega_e' \bar\beta_e$. To get the standard error, for $\text{DiD}_e$, it uses that $\text{Var}(\text{DiD}_e) = \bar\omega_e' \text{Var}(\bar\beta_e) \bar\omega_e$, where $\text{Var}(\bar\beta_e)$ is the usual (heteroskedasticity-robust) variance-covariance matrix of the OLS coefficients. Since the same unit $i$ appears on multiple rows of the sample, we must cluster on $i$ when estimating $\text{Var}(\bar\beta_e)$. Finally, the standard error corresponding to the null hypothesis of $\text{ATT}_{e} = 0$ is $\sqrt{\text{Var}(\text{DiD}_e)}$. A similar approach is used to estimate $\text{DiD}_{E}$, the average $\text{DiD}_{e}$ across a set of event times $E$. It again uses that these average DiD parameters can be represented as a linear combination of OLS coefficients $\beta_{g,e}$ with appropriate weights to construct the standard error for $\text{ATT}_{E}$.