---
title: "Get Started"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{DiDforBigData}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
There are only 3 functions in this package:
1. `DiDge()`: This function estimates DiD for a single cohort and a single event time.
2. `DiD()`: This function estimates DiD for all available cohorts and event times.
3. `SimDiD()`: This function simulates data.
We now demonstrate the simplest application of the 3 functions.
Detailed documentation for each of these function is available from the Reference tab above.
## 0. Installation
To install the package from CRAN:
```{r echo = TRUE, eval = FALSE, message=FALSE}
install.packages("DiDforBigData")
```
To install the package from Github:
```{r echo = TRUE, eval = FALSE, message=FALSE}
devtools::install_github("setzler/DiDforBigData")
```
To use the package after it is installed:
```{r echo = TRUE, eval = TRUE, message=FALSE}
library(DiDforBigData)
```
It is recommended to also make sure these optional packages have been installed:
```{r echo = TRUE, eval = TRUE, message=FALSE}
library(progress)
library(fixest)
library(parallel)
```
## 1. Prepare Data
I provide a simple data simulator as follows:
```{r echo=T, eval=T, message=FALSE}
sim = SimDiD(sample_size = 400, seed=123)
# true ATTs in the simulation
print(sim$true_ATT)
# simulated data
simdata = sim$simdata
print(simdata)
```
Your real data needs to have this "long" format, i.e., there need to be variables for the individual identifier (e.g. `id`), the time variable (e.g. `year`), the cohort at which treatment begins (e.g. `cohort`), and the outcome variable (e.g. `Y`). No other variables are required. These variables can have any names you prefer.
The never-treated cohort should be coded as infinity (`cohort = Inf`). If the cohort value is missing (`cohort = NA`), then the cohort will be automatically re-coded as infinity.
Before going to the estimation, we need to prepare a list of the variable names:
```{r echo=T, eval=T, message=FALSE}
varnames = list()
varnames$time_name = "year"
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"
```
## 2. Estimate DiD for a Single Cohort
We choose an event time (+3) and a cohort of treated units (2010), then estimate DiD:
```{r echo=T, eval=T, message=FALSE}
did_2010 = DiDge(inputdata = simdata, varnames = varnames,
cohort_time = 2010, event_postperiod = 3)
print(did_2010)
```
Comparing this estimate to the true ATT above, we see that the estimation performed well.
Note that it used -1 as the base year by default. This is easy to change.
## 3. Estimate DiD for All Cohorts and Event Times
Suppose we want to estimate the ATT at each event time from -3 to +5. We can do so as follows:
```{r echo=T, eval=T, message=FALSE}
did_all = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event = 5)
```
The output of DiD() is a list. One object in the list is results_average, which includes the average ATT across cohorts:
```{r echo=T, eval=T, message=FALSE}
print(did_all$results_average)
```
The other output from DiD() is results_cohort, which includes all combinations of event times and cohorts. It is too large to print here, so let's just print the results for event times 1 and 2:
```{r echo=T, eval=T, message=FALSE}
print(did_all$results_cohort[EventTime==1 | EventTime==2])
```
Note: the simulated data ends in 2013, so event time 2 is not available for treatment cohort 2012.
To take an average across multiple event times, use the `Esets` argument. It accepts a list, in which each item is a vector of event times over which to average:
```{r echo=T, eval=T, message=FALSE}
did_all = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event = 5,
Esets = list(c(1,2), c(1,2,3)))
```
```{r echo=T, eval=T, message=FALSE}
print(did_all$results_Esets)
```