The NCI GDC has a complex data model that allows various studies to supply numerous clinical and demographic data elements. However, across all projects that enter the GDC, there are similarities. This function returns four data.frames associated with case_ids from the GDC.
Arguments
- case_ids
a character() vector of case_ids, typically from "cases" query.
- include_list_cols
logical(1), whether to include list columns in the "main" data.frame. These list columns have values for aliquots, samples, etc. While these may be useful for some situations, they are generally not that useful as clinical annotations.
Value
A list of four data.frames:
main, representing basic case identification and metadata (update date, etc.)
diagnoses
esposures
demographic
Details
Note that these data.frames can, in general, have different numbers of rows (or even no rows at all). If one wishes to combine to produce a single data.frame, using the approach of left joining to the "main" data.frame will yield a useful combined data.frame. We do not do that directly given the potential for 1:many relationships. It is up to the user to determine what the best approach is for any given dataset.
Examples
case_ids = cases() |> results(size=10) |> ids()
clinical_data = gdc_clinical(case_ids)
# overview of clinical results
class(clinical_data)
#> [1] "GDCClinicalList" "list"
names(clinical_data)
#> [1] "demographic" "diagnoses" "exposures" "follow_ups" "main"
sapply(clinical_data, class)
#> demographic diagnoses exposures follow_ups main
#> [1,] "tbl_df" "tbl_df" "tbl_df" "tbl_df" "tbl_df"
#> [2,] "tbl" "tbl" "tbl" "tbl" "tbl"
#> [3,] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
sapply(clinical_data, nrow)
#> demographic diagnoses exposures follow_ups main
#> 10 11 10 16 10
# available data
head(clinical_data$main)
#> # A tibble: 6 × 13
#> id lost_to_followup days_to_lost_to_foll…¹ created_datetime
#> <chr> <chr> <lgl> <chr>
#> 1 69eced5b-1e76-45c9-b… NA NA 2018-10-02T15:5…
#> 2 e3b32485-b204-43a7-9… NA NA 2019-02-19T09:2…
#> 3 4829dd8c-5445-41b3-a… NA NA 2020-07-31T09:2…
#> 4 d420e653-3fb2-432b-9… NA NA 2019-10-14T10:4…
#> 5 bfe15f44-e1dd-46ed-b… NA NA 2019-08-14T15:1…
#> 6 8b3b1f24-419e-4043-8… NA NA 2018-10-02T15:5…
#> # ℹ abbreviated name: ¹days_to_lost_to_followup
#> # ℹ 9 more variables: updated_datetime <chr>, case_id <chr>, state <chr>,
#> # disease_type <chr>, submitter_id <chr>, primary_site <chr>,
#> # index_date <chr>, days_to_consent <lgl>, consent_type <lgl>
head(clinical_data$demographic)
#> # A tibble: 6 × 22
#> cause_of_death race gender ethnicity vital_status age_at_index submitter_id
#> <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
#> 1 Cancer Related white male not hisp… Dead NA HCM-BROD-00…
#> 2 NA not re… male not repo… Alive NA HCM-CSHL-00…
#> 3 Cancer Related white female Unknown Dead NA HCM-BROD-00…
#> 4 Cancer Related white female not hisp… Dead NA HCM-BROD-02…
#> 5 NA white male not hisp… Alive NA HCM-CSHL-00…
#> 6 NA white female not hisp… Alive NA HCM-CSHL-00…
#> # ℹ 15 more variables: days_to_birth <int>, created_datetime <chr>,
#> # year_of_birth <int>, premature_at_birth <lgl>,
#> # weeks_gestation_at_birth <lgl>, demographic_id <chr>,
#> # updated_datetime <chr>, age_is_obfuscated <lgl>, days_to_death <int>,
#> # state <chr>, year_of_death <lgl>, cause_of_death_source <lgl>,
#> # occupation_duration_years <lgl>, country_of_residence_at_enrollment <lgl>,
#> # case_id <chr>
head(clinical_data$diagnoses)
#> # A tibble: 6 × 108
#> case_id irs_stage iss_stage ajcc_pathologic_stage ann_arbor_clinical_s…¹
#> <chr> <lgl> <lgl> <chr> <lgl>
#> 1 69eced5b-1e7… NA NA NA NA
#> 2 e3b32485-b20… NA NA Stage III NA
#> 3 4829dd8c-544… NA NA NA NA
#> 4 d420e653-3fb… NA NA NA NA
#> 5 bfe15f44-e1d… NA NA Stage I NA
#> 6 8b3b1f24-419… NA NA Stage I NA
#> # ℹ abbreviated name: ¹ann_arbor_clinical_stage
#> # ℹ 103 more variables: created_datetime <dttm>, enneking_msts_stage <lgl>,
#> # inrg_stage <lgl>, enneking_msts_metastasis <lgl>,
#> # tissue_or_organ_of_origin <chr>, age_at_diagnosis <int>,
#> # esophageal_columnar_dysplasia_degree <lgl>, cog_liver_stage <lgl>,
#> # child_pugh_classification <lgl>, metastasis_at_diagnosis_site <lgl>,
#> # state <chr>, prior_treatment <chr>, …
head(clinical_data$exposures)
#> # A tibble: 6 × 30
#> case_id alcohol_days_per_week submitter_id alcohol_drinks_per_day
#> <chr> <lgl> <chr> <int>
#> 1 69eced5b-1e76-45c9-… NA HCM-BROD-00… NA
#> 2 e3b32485-b204-43a7-… NA HCM-CSHL-00… NA
#> 3 4829dd8c-5445-41b3-… NA HCM-BROD-00… NA
#> 4 d420e653-3fb2-432b-… NA HCM-BROD-02… NA
#> 5 bfe15f44-e1dd-46ed-… NA HCM-CSHL-00… NA
#> 6 8b3b1f24-419e-4043-… NA HCM-CSHL-00… NA
#> # ℹ 26 more variables: radon_exposure <lgl>, created_datetime <chr>,
#> # alcohol_intensity <chr>, pack_years_smoked <lgl>, asbestos_exposure <lgl>,
#> # cigarettes_per_day <lgl>, tobacco_smoking_quit_year <lgl>,
#> # tobacco_smoking_status <chr>, alcohol_history <lgl>,
#> # updated_datetime <chr>, exposure_id <chr>,
#> # tobacco_smoking_onset_year <lgl>, years_smoked <lgl>, state <chr>,
#> # secondhand_smoke_as_child <lgl>, age_at_onset <lgl>, …