Get clinical information from GDC — gdc_clinical • GenomicDataCommons

The NCI GDC has a complex data model that allows various studies to supply numerous clinical and demographic data elements. However, across all projects that enter the GDC, there are similarities. This function returns four data.frames associated with case_ids from the GDC.

Usage

gdc_clinical(case_ids, include_list_cols = FALSE)

Arguments

case_ids: a character() vector of case_ids, typically from "cases" query.
include_list_cols: logical(1), whether to include list columns in the "main" data.frame. These list columns have values for aliquots, samples, etc. While these may be useful for some situations, they are generally not that useful as clinical annotations.

Value

A list of four data.frames:

main, representing basic case identification and metadata (update date, etc.)
diagnoses
esposures
demographic

Details

Note that these data.frames can, in general, have different numbers of rows (or even no rows at all). If one wishes to combine to produce a single data.frame, using the approach of left joining to the "main" data.frame will yield a useful combined data.frame. We do not do that directly given the potential for 1:many relationships. It is up to the user to determine what the best approach is for any given dataset.

Examples

case_ids = cases() |> results(size=10) |> ids()
clinical_data = gdc_clinical(case_ids)

# overview of clinical results
class(clinical_data)
#> [1] "GDCClinicalList" "list"           
names(clinical_data)
#> [1] "demographic" "diagnoses"   "exposures"   "follow_ups"  "main"       
sapply(clinical_data, class)
#>      demographic  diagnoses    exposures    follow_ups   main        
#> [1,] "tbl_df"     "tbl_df"     "tbl_df"     "tbl_df"     "tbl_df"    
#> [2,] "tbl"        "tbl"        "tbl"        "tbl"        "tbl"       
#> [3,] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
sapply(clinical_data, nrow)
#> demographic   diagnoses   exposures  follow_ups        main 
#>          10          11          10          16          10 

# available data
head(clinical_data$main)
#> # A tibble: 6 × 13
#>   id                    lost_to_followup days_to_lost_to_foll…¹ created_datetime
#>   <chr>                 <chr>            <lgl>                  <chr>           
#> 1 69eced5b-1e76-45c9-b… NA               NA                     2018-10-02T15:5…
#> 2 e3b32485-b204-43a7-9… NA               NA                     2019-02-19T09:2…
#> 3 4829dd8c-5445-41b3-a… NA               NA                     2020-07-31T09:2…
#> 4 d420e653-3fb2-432b-9… NA               NA                     2019-10-14T10:4…
#> 5 bfe15f44-e1dd-46ed-b… NA               NA                     2019-08-14T15:1…
#> 6 8b3b1f24-419e-4043-8… NA               NA                     2018-10-02T15:5…
#> # ℹ abbreviated name: ¹days_to_lost_to_followup
#> # ℹ 9 more variables: updated_datetime <chr>, case_id <chr>, state <chr>,
#> #   disease_type <chr>, submitter_id <chr>, primary_site <chr>,
#> #   index_date <chr>, days_to_consent <lgl>, consent_type <lgl>
head(clinical_data$demographic)
#> # A tibble: 6 × 22
#>   cause_of_death race    gender ethnicity vital_status age_at_index submitter_id
#>   <chr>          <chr>   <chr>  <chr>     <chr>        <lgl>        <chr>       
#> 1 Cancer Related white   male   not hisp… Dead         NA           HCM-BROD-00…
#> 2 NA             not re… male   not repo… Alive        NA           HCM-CSHL-00…
#> 3 Cancer Related white   female Unknown   Dead         NA           HCM-BROD-00…
#> 4 Cancer Related white   female not hisp… Dead         NA           HCM-BROD-02…
#> 5 NA             white   male   not hisp… Alive        NA           HCM-CSHL-00…
#> 6 NA             white   female not hisp… Alive        NA           HCM-CSHL-00…
#> # ℹ 15 more variables: days_to_birth <int>, created_datetime <chr>,
#> #   year_of_birth <int>, premature_at_birth <lgl>,
#> #   weeks_gestation_at_birth <lgl>, demographic_id <chr>,
#> #   updated_datetime <chr>, age_is_obfuscated <lgl>, days_to_death <int>,
#> #   state <chr>, year_of_death <lgl>, cause_of_death_source <lgl>,
#> #   occupation_duration_years <lgl>, country_of_residence_at_enrollment <lgl>,
#> #   case_id <chr>
head(clinical_data$diagnoses)
#> # A tibble: 6 × 108
#>   case_id       irs_stage iss_stage ajcc_pathologic_stage ann_arbor_clinical_s…¹
#>   <chr>         <lgl>     <lgl>     <chr>                 <lgl>                 
#> 1 69eced5b-1e7… NA        NA        NA                    NA                    
#> 2 e3b32485-b20… NA        NA        Stage III             NA                    
#> 3 4829dd8c-544… NA        NA        NA                    NA                    
#> 4 d420e653-3fb… NA        NA        NA                    NA                    
#> 5 bfe15f44-e1d… NA        NA        Stage I               NA                    
#> 6 8b3b1f24-419… NA        NA        Stage I               NA                    
#> # ℹ abbreviated name: ¹ann_arbor_clinical_stage
#> # ℹ 103 more variables: created_datetime <dttm>, enneking_msts_stage <lgl>,
#> #   inrg_stage <lgl>, enneking_msts_metastasis <lgl>,
#> #   tissue_or_organ_of_origin <chr>, age_at_diagnosis <int>,
#> #   esophageal_columnar_dysplasia_degree <lgl>, cog_liver_stage <lgl>,
#> #   child_pugh_classification <lgl>, metastasis_at_diagnosis_site <lgl>,
#> #   state <chr>, prior_treatment <chr>, …
head(clinical_data$exposures)
#> # A tibble: 6 × 30
#>   case_id              alcohol_days_per_week submitter_id alcohol_drinks_per_day
#>   <chr>                <lgl>                 <chr>                         <int>
#> 1 69eced5b-1e76-45c9-… NA                    HCM-BROD-00…                     NA
#> 2 e3b32485-b204-43a7-… NA                    HCM-CSHL-00…                     NA
#> 3 4829dd8c-5445-41b3-… NA                    HCM-BROD-00…                     NA
#> 4 d420e653-3fb2-432b-… NA                    HCM-BROD-02…                     NA
#> 5 bfe15f44-e1dd-46ed-… NA                    HCM-CSHL-00…                     NA
#> 6 8b3b1f24-419e-4043-… NA                    HCM-CSHL-00…                     NA
#> # ℹ 26 more variables: radon_exposure <lgl>, created_datetime <chr>,
#> #   alcohol_intensity <chr>, pack_years_smoked <lgl>, asbestos_exposure <lgl>,
#> #   cigarettes_per_day <lgl>, tobacco_smoking_quit_year <lgl>,
#> #   tobacco_smoking_status <chr>, alcohol_history <lgl>,
#> #   updated_datetime <chr>, exposure_id <chr>,
#> #   tobacco_smoking_onset_year <lgl>, years_smoked <lgl>, state <chr>,
#> #   secondhand_smoke_as_child <lgl>, age_at_onset <lgl>, …