Get Started

There are only 3 functions in this package:

  1. DiDge(): This function estimates DiD for a single cohort and a single event time.
  2. DiD(): This function estimates DiD for all available cohorts and event times.
  3. SimDiD(): This function simulates data.

We now demonstrate the simplest application of the 3 functions.

Detailed documentation for each of these function is available from the Reference tab above.

0. Installation

To install the package from CRAN:

install.packages("DiDforBigData")

To install the package from Github:

devtools::install_github("setzler/DiDforBigData")

To use the package after it is installed:

library(DiDforBigData)

It is recommended to also make sure these optional packages have been installed:

library(progress)
library(fixest)
library(parallel)

1. Prepare Data

I provide a simple data simulator as follows:

sim = SimDiD(sample_size = 400, seed=123)

# true ATTs in the simulation
print(sim$true_ATT)
#>      cohort event    ATTge
#>      <char> <num>    <num>
#>  1:    2007     0 1.000000
#>  2:    2007     1 2.000000
#>  3:    2007     2 3.000000
#>  4:    2007     3 4.000000
#>  5:    2007     4 5.000000
#>  6:    2007     5 6.000000
#>  7:    2007     6 7.000000
#>  8:    2010     0 1.500000
#>  9:    2010     1 2.500000
#> 10:    2010     2 3.500000
#> 11:    2010     3 4.500000
#> 12:    2012     0 2.000000
#> 13:    2012     1 3.000000
#> 14: Average     0 1.501672
#> 15: Average     1 2.501672
#> 16: Average     2 3.251256
#> 17: Average     3 4.251256
#> 18: Average     4 5.000000
#> 19: Average     5 6.000000
#> 20: Average     6 7.000000
#>      cohort event    ATTge

# simulated data
simdata = sim$simdata
print(simdata)
#>          id  year cohort         Y
#>       <int> <int>  <num>     <num>
#>    1:     1  2003   2010  8.773933
#>    2:     1  2004   2010  9.846116
#>    3:     1  2005   2010  9.963274
#>    4:     1  2006   2010  9.997385
#>    5:     1  2007   2010 10.060080
#>   ---                             
#> 4396:   400  2009   2007  8.035127
#> 4397:   400  2010   2007 14.438798
#> 4398:   400  2011   2007 11.973035
#> 4399:   400  2012   2007 13.033367
#> 4400:   400  2013   2007 13.552533

Your real data needs to have this “long” format, i.e., there need to be variables for the individual identifier (e.g. id), the time variable (e.g. year), the cohort at which treatment begins (e.g. cohort), and the outcome variable (e.g. Y). No other variables are required. These variables can have any names you prefer.

The never-treated cohort should be coded as infinity (cohort = Inf). If the cohort value is missing (cohort = NA), then the cohort will be automatically re-coded as infinity.

Before going to the estimation, we need to prepare a list of the variable names:

varnames = list()
varnames$time_name = "year" 
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"

2. Estimate DiD for a Single Cohort

We choose an event time (+3) and a cohort of treated units (2010), then estimate DiD:

did_2010 = DiDge(inputdata = simdata, varnames = varnames, 
             cohort_time = 2010, event_postperiod = 3)

print(did_2010)
#>    Cohort EventTime BaseEvent CalendarTime    ATTge  ATTge_SE Ncontrol Ntreated
#>     <num>     <num>     <num>        <num>    <num>     <num>    <int>    <int>
#> 1:   2010         3        -1         2013 4.629839 0.1962355      101      100

Comparing this estimate to the true ATT above, we see that the estimation performed well.

Note that it used -1 as the base year by default. This is easy to change.

3. Estimate DiD for All Cohorts and Event Times

Suppose we want to estimate the ATT at each event time from -3 to +5. We can do so as follows:

did_all = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event = 5)

The output of DiD() is a list. One object in the list is results_average, which includes the average ATT across cohorts:

print(did_all$results_average)
#> Key: <EventTime>
#>    EventTime BaseEvent        ATTe    ATTe_SE Ncontrol Ntreated
#>        <num>     <num>       <num>      <num>    <int>    <int>
#> 1:        -3        -1 -0.03472821 0.10802340      603      299
#> 2:        -2        -1 -0.06416254 0.09847063      603      299
#> 3:        -1        -1  0.00000000 0.00000000      603      299
#> 4:         0        -1  1.44852075 0.10387376      603      299
#> 5:         1        -1  2.67299583 0.09964407      603      299
#> 6:         2        -1  3.17946138 0.12477922      402      199
#> 7:         3        -1  4.27349270 0.12596253      302      199
#> 8:         4        -1  4.98423853 0.17470913      201       99
#> 9:         5        -1  5.66743134 0.21029573      101       99

The other output from DiD() is results_cohort, which includes all combinations of event times and cohorts. It is too large to print here, so let’s just print the results for event times 1 and 2:

print(did_all$results_cohort[EventTime==1 | EventTime==2])
#>    Cohort EventTime BaseEvent CalendarTime    ATTge  ATTge_SE Ncontrol Ntreated
#>     <num>     <num>     <num>        <num>    <num>     <num>    <int>    <int>
#> 1:   2007         1        -1         2008 2.263430 0.1498733      301       99
#> 2:   2007         2        -1         2009 3.083096 0.1666782      301       99
#> 3:   2010         1        -1         2011 2.474058 0.1733037      201      100
#> 4:   2010         2        -1         2012 3.274863 0.1863323      101      100
#> 5:   2012         1        -1         2013 3.277404 0.2117916      101      100

Note: the simulated data ends in 2013, so event time 2 is not available for treatment cohort 2012.

To take an average across multiple event times, use the Esets argument. It accepts a list, in which each item is a vector of event times over which to average:

did_all = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event = 5, 
              Esets = list(c(1,2), c(1,2,3)))
print(did_all$results_Esets)
#>      Eset ATT_Eset ATT_Eset_SE
#>    <char>    <num>       <num>
#> 1:    1,2 2.926229  0.08930124
#> 2:  1,2,3 3.375317  0.08822397