Working with cohorts

Cohorts are a fundamental building block for observational health data analysis. A “cohort” is a set of persons satisfying a one or more inclusion criteria for a duration of time. If you are familiar with the idea of sets in math then a cohort can be nicely represented as a set of person-days. In the OMOP Common Data Model we represent cohorts using a table with four columns.

An example cohort table
cohort_definition_id subject_id cohort_start_date cohort_end_date
1 1000 2020-01-01 2020-05-01
1 1000 2021-06-01 2020-07-01
1 2000 2020-03-01 2020-09-01
2 1000 2020-02-01 2020-03-01

A cohort table can contain multiple cohorts and each cohort can have multiple persons. There can even be multiple records for the same person in a single cohort as long as the date ranges do not overlap. In the same way that an element is either in a set or not, a single person-day is either in a cohort or not. For a more comprehensive treatment of cohorts in OHDSI check out the Cohorts chapter in The Book of OHDSI.

Cohort Generation

The \(n*4\) cohort table is created through the process of cohort generation. To generate a cohort on a specific CDM dataset means that we combine a cohort definition with CDM to produce a cohort table. The standardization provided by the OMOP CDM allows researchers to generate the same cohort definition on any OMOP CDM dataset.

A cohort definition is an expression of the rules goverining the inclusion/exclusion of person-days in the cohort. There are three common ways to create cohort definitions for the OMOP CDM.

  1. The Atlas cohort builder

  2. The Capr R package

  3. Custom SQL and/or R code

Atlas is a web application that provides a graphical user interface for creating cohort definitions. . To get started with Atlas check out the free course on Ehden Academy and the demo at https://atlas-demo.ohdsi.org/.

Capr is an R package that provides a code-based interface for creating cohort definitions. The options available in Capr exactly match the options available in Atlas and the resulting cohort tables should be identical.

There are times when more customization is needed and it is possible to use bespoke SQL or dplyr code to build a cohort. CDMConnector provides the generate_concept_cohort_set function for quickly building simple cohorts that can then be a starting point for further subsetting.

Atlas cohorts are represented using json text files. To “generate” one or more Atlas cohorts on a cdm object use the read_cohort_set function to first read a folder of Atlas cohort json files into R. Then create the cohort table with generate_cohort_set. There can be an optional csv file called “CohortsToCreate.csv” in the folder that specifies the cohort IDs and names to use. If this file doesn’t exist IDs will be assigned automatically using alphabetical order of the filenames.

path_to_cohort_json_files <- system.file("cohorts1", package = "CDMConnector")
list.files(path_to_cohort_json_files)
#> [1] "CohortsToCreate.csv"                     
#> [2] "cerebral_venous_sinus_thrombosis_01.json"
#> [3] "deep_vein_thrombosis_01.json"

readr::read_csv(file.path(path_to_cohort_json_files, "CohortsToCreate.csv"),
                show_col_types = FALSE)
#> # A tibble: 2 × 3
#>   cohortId cohortName                          jsonPath                         
#>      <dbl> <chr>                               <chr>                            
#> 1        1 cerebral_venous_sinus_thrombosis_01 ./cerebral_venous_sinus_thrombos…
#> 2        2 deep_vein_thrombosis_01             ./deep_vein_thrombosis_01.json

Atlas cohort definitions

First we need to create our CDM object. Note that we will need to specify a write_schema when creating the object. Cohort tables will go into the CDM’s write_schema.

library(CDMConnector)
path_to_cohort_json_files <- system.file("example_cohorts", 
                                         package = "CDMConnector")
list.files(path_to_cohort_json_files)
#> [1] "GIBleed_male.json"    "GiBleed_default.json"

con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir("GiBleed"))
cdm <- cdm_from_con(con, cdm_name = "eunomia", cdm_schema = "main", write_schema = "main")

cohort_details <- read_cohort_set(path_to_cohort_json_files) |>
  mutate(cohort_name = snakecase::to_snake_case(cohort_name))

cohort_details
#> # A tibble: 2 × 5
#>   cohort_definition_id cohort_name     cohort       json   cohort_name_snakecase
#>                  <int> <chr>           <list>       <list> <chr>                
#> 1                    1 gibleed_male    <named list> <chr>  gibleed_male         
#> 2                    2 gibleed_default <named list> <chr>  gibleed_default

cdm <- generate_cohort_set(
  cdm = cdm, 
  cohort_set = cohort_details,
  name = "study_cohorts"
)
#> ℹ Generating 2 cohorts
#> ℹ Generating cohort (1/2) - gibleed_male)
#> ✔ Generating cohort (1/2) - gibleed_male) [167ms]
#> 
#> ℹ Generating cohort (2/2) - gibleed_default)
#> ✔ Generating cohort (2/2) - gibleed_default) [61ms]
#> 

cdm$study_cohorts
#> # Source:   table<study_cohorts> [?? x 4]
#> # Database: DuckDB v0.10.0 [root@Darwin 23.0.0:R 4.3.1//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpQWxFWw/file14d177f2ba701.duckdb]
#>    cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                   <int>      <dbl> <date>            <date>         
#>  1                    1       1737 1950-09-18        2019-05-24     
#>  2                    1       1201 2008-01-19        2018-10-31     
#>  3                    1       3483 1989-01-07        2018-09-22     
#>  4                    1       3064 1975-06-26        2018-10-19     
#>  5                    1       1965 2010-05-27        2019-02-17     
#>  6                    1        936 2002-05-22        2018-11-16     
#>  7                    1       5111 1980-02-02        1984-12-10     
#>  8                    1        962 1995-07-09        2019-06-14     
#>  9                    1        935 1972-07-04        1994-06-01     
#> 10                    1       3547 2014-07-17        2018-08-06     
#> # ℹ more rows

The generated cohort has associated metadata tables. We can access these with utility functions.

cohort_count(cdm$study_cohorts)
#> # A tibble: 2 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1            237             237
#> 2                    2            479             479
cohort_set(cdm$study_cohorts)
#> Warning: `cohort_set()` was deprecated in CDMConnector 1.3.
#> ℹ Please use `settings()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> # A tibble: 2 × 2
#>   cohort_definition_id cohort_name    
#>                  <int> <chr>          
#> 1                    1 gibleed_male   
#> 2                    2 gibleed_default
attrition(cdm$study_cohorts)
#> # A tibble: 4 × 7
#>   cohort_definition_id number_records number_subjects reason_id reason          
#>                  <int>          <int>           <int>     <int> <chr>           
#> 1                    1            479             479         1 Qualifying init…
#> 2                    1            237             237         2 Male            
#> 3                    1            237             237         3 30 days prior o…
#> 4                    2            479             479         1 Qualifying init…
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>

Note the this cohort table is still in the database so it can be quite large. We can also join it to other CDM table or subset the entire cdm to just the persons in the cohort.

cdm_gibleed <- cdm %>% 
  cdm_subset_cohort(cohort_table = "study_cohorts")

Capr cohort definitions

Capr allows us to use R code to create the same cohorts that can be created in Atlas. This is helpful when you need to create a large number of similar cohort definitions. Below we create a single Cohort definition with one inclusion criteria

generate_cohort_set will accept a named list of Capr

library(Capr)
#> Loading required package: CirceR
#> 
#> Attaching package: 'Capr'
#> The following object is masked from 'package:ggplot2':
#> 
#>     unit
#> The following object is masked from 'package:CDMConnector':
#> 
#>     attrition

gibleed_concept_set <- cs(192671, name = "gibleed")

gibleed_definition <- cohort(
  entry = conditionOccurrence(gibleed_concept_set)
)

gibleed_male_definition <- cohort(
  entry = conditionOccurrence(gibleed_concept_set, male())
)

# create a named list of Capr cohort definitions
cohort_details = list(gibleed = gibleed_definition,
                  gibleed_male = gibleed_male_definition)

# generate cohorts
cdm <- generate_cohort_set(
  cdm,
  cohort_set = cohort_details,
  name = "gibleed" # name for the cohort table in the cdm
)
#> ℹ Generating 2 cohorts
#> ℹ Generating cohort (1/2) - gibleed)
#> ✔ Generating cohort (1/2) - gibleed) [41ms]
#> 
#> ℹ Generating cohort (2/2) - gibleed_male)
#> ✔ Generating cohort (2/2) - gibleed_male) [42ms]
#> 

cdm$gibleed
#> # Source:   table<gibleed> [?? x 4]
#> # Database: DuckDB v0.10.0 [root@Darwin 23.0.0:R 4.3.1//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpQWxFWw/file14d177f2ba701.duckdb]
#>    cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                   <int>      <dbl> <date>            <date>         
#>  1                    1       2836 2019-03-21        2019-03-22     
#>  2                    1       3547 2014-07-17        2018-08-06     
#>  3                    1       4337 2006-04-26        2018-07-27     
#>  4                    1        935 1972-07-04        1994-06-01     
#>  5                    1       3561 1944-01-20        2018-11-05     
#>  6                    1        962 1995-07-09        2019-06-14     
#>  7                    1       1691 1985-10-20        2018-08-23     
#>  8                    1       1737 1950-09-18        2019-05-24     
#>  9                    1       1844 2002-12-16        2018-11-21     
#> 10                    1       1137 1992-01-13        2013-10-26     
#> # ℹ more rows

We should get the exact same result from Capr and Atlas if the definitions are equivalent.

Learn more about Capr at the package website https://ohdsi.github.io/Capr/.

DBI::dbDisconnect(con, shutdown = TRUE)

Subset a cohort

Suppose you have a generated cohort and you would like to create a new cohort that is a subset of the first. This can be done using the

First we will generate an example cohort set and then create a new cohort based on filtering the Atlas cohort.

library(CDMConnector)
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir())
cdm <- cdm_from_con(con, cdm_schema = "main", write_schema = "main")

cohort_set <- read_cohort_set(system.file("cohorts3", package = "CDMConnector"))


cdm <- generate_cohort_set(cdm, cohort_set, name = "cohort") 
#> ℹ Generating 5 cohorts
#> ℹ Generating cohort (1/5) - gibleed_all)
#> ✔ Generating cohort (1/5) - gibleed_all) [43ms]
#> 
#> ℹ Generating cohort (2/5) - gibleed_all_end_10)
#> ✔ Generating cohort (2/5) - gibleed_all_end_10) [47ms]
#> 
#> ℹ Generating cohort (3/5) - gibleed_default)
#> ✔ Generating cohort (3/5) - gibleed_default) [43ms]
#> 
#> ℹ Generating cohort (4/5) - gibleed_default_with_descendants)
#> ✔ Generating cohort (4/5) - gibleed_default_with_descendants) [44ms]
#> 
#> ℹ Generating cohort (5/5) - gibleed_end_10)
#> ✔ Generating cohort (5/5) - gibleed_end_10) [44ms]
#> 

cdm$cohort
#> # Source:   table<cohort> [?? x 4]
#> # Database: DuckDB v0.10.0 [root@Darwin 23.0.0:R 4.3.1//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpQWxFWw/file14d17282faeb9.duckdb]
#>    cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                   <int>      <dbl> <date>            <date>         
#>  1                    1       2206 2008-04-09        2018-03-06     
#>  2                    1       3232 2004-07-15        2018-10-15     
#>  3                    1       1934 2000-09-08        2019-03-09     
#>  4                    1       2138 1990-07-14        2018-08-05     
#>  5                    1        951 1985-11-07        2019-01-22     
#>  6                    1       4829 2012-07-04        2019-04-07     
#>  7                    1       2328 1999-06-28        2019-01-21     
#>  8                    1       4253 2018-10-25        2018-10-26     
#>  9                    1       5245 2016-01-31        2018-01-29     
#> 10                    1       4320 2010-03-23        2019-03-03     
#> # ℹ more rows


cohort_count(cdm$cohort)
#> # A tibble: 5 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1            479             479
#> 2                    2            479             479
#> 3                    3            479             479
#> 4                    4            479             479
#> 5                    5            479             479

As an example we will take only people in the cohort that have a cohort duration that is longer than 4 weeks. Using dplyr we can write this query and save the result in a new table in the cdm.

library(dplyr)

cdm$cohort_subset <- cdm$cohort %>% 
  # only keep persons who are in the cohort at least 28 days
  filter(!!datediff("cohort_start_date", "cohort_end_date") >= 28) %>% 
  # optionally you can modify the cohort_id
  mutate(cohort_definition_id = 100 + cohort_definition_id) %>% 
  compute(name = "cohort_subset", temporary = FALSE, overwrite = TRUE) %>% 
  new_generated_cohort_set()
#> Warning: `new_generated_cohort_set()` was deprecated in CDMConnector 1.3.
#> ℹ Please use `newCohortTable()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

cdm$cohort2
#> NULL

cohort_count(cdm$cohort_subset)
#> # A tibble: 3 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                  101            466             466
#> 2                  103            466             466
#> 3                  104            466             466

In this case we can see that cohorts 1 and 5 were dropped completely and some patients were dropped from cohorts 2, 3, and 4.

Let’s confirm that everyone in cohorts 1 and 5 were in the cohort for less than 28 days.

days_in_cohort <- cdm$cohort %>% 
  filter(cohort_definition_id %in% c(1,5)) %>% 
  mutate(days_in_cohort = !!datediff("cohort_start_date", "cohort_end_date")) %>% 
  count(cohort_definition_id, days_in_cohort) %>% 
  collect()

days_in_cohort
#> # A tibble: 465 × 3
#>    cohort_definition_id days_in_cohort     n
#>                   <int>          <dbl> <dbl>
#>  1                    1           6756     1
#>  2                    1           2604     1
#>  3                    1           8000     1
#>  4                    1           4125     1
#>  5                    1          22188     1
#>  6                    1          26086     1
#>  7                    1           6816     1
#>  8                    1           9427     1
#>  9                    1          18399     1
#> 10                    1          19950     1
#> # ℹ 455 more rows

We have confirmed that everyone in cohorts 1 and 5 were in the cohort less than 10 days.

Now suppose we would like to create a new cohort table with three different versions of the cohorts in the original cohort table. We will keep persons who are in the cohort at 2 weeks, 3 weeks, and 4 weeks. We can simply write some custom dplyr to create the table and then call new_generated_cohort_set just like in the previous example.


cdm$cohort_subset <- cdm$cohort %>% 
  filter(!!datediff("cohort_start_date", "cohort_end_date") >= 14) %>% 
  mutate(cohort_definition_id = 10 + cohort_definition_id) %>% 
  union_all(
    cdm$cohort %>%
    filter(!!datediff("cohort_start_date", "cohort_end_date") >= 21) %>% 
    mutate(cohort_definition_id = 100 + cohort_definition_id)
  ) %>% 
  union_all(
    cdm$cohort %>% 
    filter(!!datediff("cohort_start_date", "cohort_end_date") >= 28) %>% 
    mutate(cohort_definition_id = 1000 + cohort_definition_id)
  ) %>% 
  compute(name = "cohort_subset", temporary = FALSE, overwrite = TRUE) %>% 
  new_generated_cohort_set() # this function creates the cohort object and metadata

cdm$cohort_subset %>% 
  mutate(days_in_cohort = !!datediff("cohort_start_date", "cohort_end_date")) %>% 
  group_by(cohort_definition_id) %>% 
  summarize(mean_days_in_cohort = mean(days_in_cohort, na.rm = TRUE)) %>% 
  collect() %>% 
  arrange(mean_days_in_cohort)
#> # A tibble: 9 × 2
#>   cohort_definition_id mean_days_in_cohort
#>                  <dbl>               <dbl>
#> 1                   11               7586.
#> 2                   13               7586.
#> 3                   14               7586.
#> 4                 1001               7602.
#> 5                  103               7602.
#> 6                  104               7602.
#> 7                 1003               7602.
#> 8                  101               7602.
#> 9                 1004               7602.

This is an example of creating new cohorts from existing cohorts using CDMConnector. There is a lot of flexibility with this approach. Next we will look at completely custom cohort creation which is quite similar.

Custom Cohort Creation

Sometimes you may want to create cohorts that cannot be easily expressed using Atlas or Capr. In these situations you can create implement cohort creation using SQL or R. See the chapter in The Book of OHDSI for details on using SQL to create cohorts. CDMConnector provides a helper function to build simple cohorts from a list of OMOP concepts. generate_concept_cohort_set accepts a named list of concept sets and will create cohorts based on those concept sets. While this function does not allow for inclusion/exclusion criteria in the initial definition, additional criteria can be applied “manually” after the initial generation.


library(dplyr, warn.conflicts = FALSE)

cdm <- generate_concept_cohort_set(
  cdm, 
  concept_set = list(gibleed = 192671), 
  name = "gibleed2", # name of the cohort table
  limit = "all", # use all occurrences of the concept instead of just the first
  end = 10 # set explicit cohort end date 10 days after start
)

cdm$gibleed2 <- cdm$gibleed2 %>% 
  semi_join(
    filter(cdm$person, gender_concept_id == 8507), 
    by = c("subject_id" = "person_id")
  ) %>% 
  record_cohort_attrition(reason = "Male")
  
attrition(cdm$gibleed2) 
#> An object of class "CohortAttrition"
#> Slot "rules":
#> [[1]]
#> # Source:   SQL [?? x 4]
#> # Database: DuckDB v0.10.0 [root@Darwin 23.0.0:R 4.3.1//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpQWxFWw/file14d17282faeb9.duckdb]
#>    cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                   <int>      <int> <date>            <date>         
#>  1                    1        936 2002-05-22        2002-06-01     
#>  2                    1       1540 1994-05-06        1994-05-16     
#>  3                    1       3195 1984-10-22        1984-11-01     
#>  4                    1       4156 1997-04-02        1997-04-12     
#>  5                    1       3259 1999-11-21        1999-12-01     
#>  6                    1       3822 1986-11-13        1986-11-23     
#>  7                    1       4006 1995-01-10        1995-01-20     
#>  8                    1       4487 1988-03-26        1988-04-05     
#>  9                    1        432 1973-01-24        1973-02-03     
#> 10                    1       3178 1994-10-06        1994-10-16     
#> # ℹ more rows
#> 
#> 
#> Slot "expressionLimit":
#> [1] "First"

We could visualise attrition using a package like VisR

library(visR)
gibleed2_attrition <- CDMConnector::attrition(cdm$gibleed2)  %>% 
    dplyr::select(Criteria = "reason", `Remaining N` = "number_subjects")
class(gibleed2_attrition) <- c("attrition", class(gibleed2_attrition))
visr(gibleed2_attrition)

In the above example we built a cohort table from a concept set. The cohort essentially captures patient-time based off of the presence or absence of OMOP standard concept IDs. We then manually applied an inclusion criteria and recorded a new attrition record in the cohort. To learn more about this approach to building cohorts check out the PatientProfiles R package.

You can also create a generated cohort set using any method you choose. As long as the table is in the CDM database and has the four required columns it can be added to the CDM object as a generated cohort set.

Suppose for example our cohort table is

cohort <- dplyr::tibble(
  cohort_definition_id = 1L,
  subject_id = 1L,
  cohort_start_date = as.Date("1999-01-01"),
  cohort_end_date = as.Date("2001-01-01")
)

cohort
#> # A tibble: 1 × 4
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                  <int>      <int> <date>            <date>         
#> 1                    1          1 1999-01-01        2001-01-01

First make sure the table is in the database and create a dplyr table reference to it and add it to the CDM object.

library(omopgenerics)
#> 
#> Attaching package: 'omopgenerics'
#> The following object is masked from 'package:Capr':
#> 
#>     attrition
#> The following objects are masked from 'package:CDMConnector':
#> 
#>     cdmName, recordCohortAttrition, uniqueTableName
#> The following object is masked from 'package:stats':
#> 
#>     filter
cdm <- insertTable(cdm = cdm, name = "cohort", table = cohort, overwrite = TRUE)

cdm$cohort
#> # Source:   table<cohort> [1 x 4]
#> # Database: DuckDB v0.10.0 [root@Darwin 23.0.0:R 4.3.1//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpQWxFWw/file14d17282faeb9.duckdb]
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                  <int>      <int> <date>            <date>         
#> 1                    1          1 1999-01-01        2001-01-01

To make this a true generated cohort object use the cohort_table

cdm$cohort <- newCohortTable(cdm$cohort)

We can see that this cohort is now has the class “cohort_table” as well as the various metadata tables.

cohort_count(cdm$cohort)
#> # A tibble: 1 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1              1               1
cohort_set(cdm$cohort)
#> # A tibble: 1 × 2
#>   cohort_definition_id cohort_name
#>                  <int> <chr>      
#> 1                    1 cohort_1
attrition(cdm$cohort)
#> # A tibble: 1 × 7
#>   cohort_definition_id number_records number_subjects reason_id reason          
#>                  <int>          <int>           <int>     <int> <chr>           
#> 1                    1              1               1         1 Initial qualify…
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>

If you would like to override the attribute tables then pass additional dataframes to cohortTable

cdm <- insertTable(cdm = cdm, name = "cohort2", table = cohort, overwrite = TRUE)
cdm$cohort2 <- newCohortTable(cdm$cohort2)
settings(cdm$cohort2)
#> # A tibble: 1 × 2
#>   cohort_definition_id cohort_name
#>                  <int> <chr>      
#> 1                    1 cohort_1

cohort_set <- data.frame(cohort_definition_id = 1L,
                         cohort_name = "made_up_cohort")
cdm$cohort2 <- newCohortTable(cdm$cohort2, cohortSetRef = cohort_set)

settings(cdm$cohort2)
#> # A tibble: 1 × 2
#>   cohort_definition_id cohort_name   
#>                  <int> <chr>         
#> 1                    1 made_up_cohort
DBI::dbDisconnect(con, shutdown = TRUE)

Cohort building is a fundamental building block of observational health analysis and CDMConnector supports different ways of creating cohorts. As long as your cohort table is has the required structure and columns you can add it to the cdm with the new_generated_cohort_set function and use it in any downstream OHDSI analytic packages.