Last modified: 6 May, 2019
R / Bioconductor for Everyone
Overview
Description
This workshop is intended for those with little or no experience using R or Bioconductor. In the first portion of the workshop, we will explore the basics of using RStudio, essential R data types, composing short scripts and using functions, and installing and using packages that extend base R functionality. The second portion of the workshop orients participants to the Bioconductor collection of R packages for analysis and comprehension of high-throughput genomic data. We will describe how to discover, install, and learn to use Bioconductor packages, and will explore some of the unique ways in which Bioconductor represents genomic data. The workshop will primarily be instructor-led live demos, with participants following along in their own RStudio sessions.
Pre-requisites
This workshop is meant to be introductory, and has no pre-requisites.
Participants will maximize benefit from the workshop by pursuing elementary instructions for using R, such as the introductory course from DataCamp.
Participation
Participants will use RStudio interactively as the instructor moves carefully through short examples of R and Bioconductor code.
R / Bioconductor packages used
- Base R packages, e.g.,
stats
,graphics
; ggplot2 - Essential Bioconductor packages, e.g., rtracklayer, GenomicRanges, SummarizedExperiment
Time outline
Activity | Time |
---|---|
Introduction to R | 45m |
- Using RStudio | |
- R vectors and data frames | |
- Data input and manipulation | |
- Scripts and functions | |
- Using R packages | |
Introduction to Bioconductor | 60m |
- Project history | |
- Discovering and using packages | |
- Working with objects |
Workshop goals and objectives
Learning goals
Part 1: R
- Import text (e.g., ‘comma-separate value’) files, into R.
- Perform manipulations of R data frames.
- Apply R functions for statistical analysis and visualization.
- Use R packages to extend basic functionality.
Part 2: Bioconductor
- Find, install, and learn how to use Bioconductor packages.
- Import and manipulate genomic files and Bioconductor data objects.
- Start an RNA-seq differential expression work flow.
Learning objectives
Part 1: R
- Import the ‘pData.csv’ file describing samples from an experiment.
- ‘Clean’ the pData to include only a subset of values.
- Perform a t-test assessing the relationship between ‘age’ and ‘mol.biol’ (presence of BCR/ABL) amongst samples.
- Visualize the relationship between ‘age’ and ‘mol.biol’ using base
R’s
boxplot()
function. - Load the ggplot2 package.
- Visualize the relationship between two variables using
ggplot()
.
Part 2: Bioconductor
- Discover, install, and read the vignette of the DESeq2 package.
- Discover the ‘single cell sequencing’ vignette
- Import BED and GTF files into Bioconductor
- Find regions of overlap between the BED and GTF files.
- Import a matrix and data.frame into Bioconductor’s
SummarizedExperiment
object for RNA-Seq differential expression.
Introduction to R
RStudio orientation
Very basics
R works by typing into the console. A very simple example is to add the numbers 1 and 2.
1 + 2
## [1] 3
R works best on vectors, and we can construct vectors ‘by hand’ using the function c()
(perhaps for _c_oncatenate). Here is a vector of the numbers 1, 3, and 5.
c(1, 3, 5)
## [1] 1 3 5
R has many shortcuts. A very commonly used shortcut is to create a vector representing a sequence of numbers using :
. For instance, here is a vector representing the numbers 1 through 5.
1:5
## [1] 1 2 3 4 5
It would be pretty tedious to continually have to enter vectors by hand. R allows objects such as the vector returned by c()
to be assigned to variables. The variables can then be referenced at subsequent points in the script.
x <- c(1, 3, 5)
x
## [1] 1 3 5
Since R knows about vectors, it can be very easy to transform all elements of the vector, e.g., taking the log()
of x
;
log(x)
## [1] 0.000000 1.098612 1.609438
The return value of log(x)
is itself a vector, and can be assigned to another variable.
y <- log(x)
y
## [1] 0.000000 1.098612 1.609438
Data input and manipulation
Read data files into R data.frame
objects. We start by reading a simple file into R. The file is a ‘csv’ (comma-separated value) file that could have been exported from a spreadsheet. The first few lines of the file include:
"Sample","sex","age","mol.biol"
"01005","M",53,"BCR/ABL"
"01010","M",19,"NEG"
"03002","F",52,"BCR/ABL"
"04006","M",38,"ALL1/AF4"
"04007","M",57,"NEG"
The file describes samples used in a classic microarray experiment. It consists of four columns:
Sample
: a unique identifiersex
: the sex of each patientage
: the age of each patientmol.biol
: cytological characterization of each patient, e.g.,"BCR/ABL"
indicates presence of the classic BCR/ABL translocation, while"NEG"
indicates no special cytological information.
We start by asking are to find the path to the file, using a function file.choose()
. This will open a dialog box, and we will navigate to the location of a file named "ALL-phenoData.csv"
. Print the value of fname
to view the path that you choose.
fname <- file.choose()
fname
Confirm that the file exists using the file.exists()
command.
file.exists(fname)
## [1] TRUE
Read the data from the csv file using read.csv()
; assign the input to a variable called pdata
.
pdata <- read.csv(fname)
Viewing the data.frame
We could print the entire data frame to the screen using
pdata
but a smarter way to explore the data is to ask about its dimensions (rows and columns) using dim()
, to look at the head()
and tail()
of the data, and to ask R for a brief summary()
.
dim(pdata) # 128 rows x 4 columns
## [1] 128 4
head(pdata) # first six rows
## Sample sex age mol.biol
## 1 01005 M 53 BCR/ABL
## 2 01010 M 19 NEG
## 3 03002 F 52 BCR/ABL
## 4 04006 M 38 ALL1/AF4
## 5 04007 M 57 NEG
## 6 04008 M 17 NEG
tail(pdata) # last six rows
## Sample sex age mol.biol
## 123 49004 M 24 NEG
## 124 56007 M 37 NEG
## 125 64005 M 19 NEG
## 126 65003 M 30 NEG
## 127 83001 M 29 NEG
## 128 LAL4 <NA> NA NEG
summary(pdata)
## Sample sex age mol.biol
## 01003 : 1 F :42 Min. : 5.00 ALL1/AF4:10
## 01005 : 1 M :83 1st Qu.:19.00 BCR/ABL :37
## 01007 : 1 NA's: 3 Median :29.00 E2A/PBX1: 5
## 01010 : 1 Mean :32.37 NEG :74
## 02020 : 1 3rd Qu.:45.50 NUP-98 : 1
## 03002 : 1 Max. :58.00 p15/p16 : 1
## (Other):122 NA's :5
The age
element of pdata
is a vector of integers and the summary provides a quantitative description of the data. Some values are missing, and these have the special value NA
.
The sex
and mol.biol
vectors are factors. The sex
variable has two levels (M
, F
) while the mol.biol
vector has 6 levels. A factor is a statistical concept central to describing data; the levels describe the universe of possible values that the variable can take.
Sample
is also a factor
, but it should probably be considered a character
vector used to identify each sample; we clean this up in just a minute.
In addition to summary()
, it can be helpful to use class()
to look at the class of each object or column, e.g.,
class(fname)
## [1] "character"
class(pdata)
## [1] "data.frame"
Reading data, round two
We noted that Sample
has been read by R as a factor. Consult the (complicated!) help page for read.csv()
and try to understand how to use colClasses
to specify how each column of data should be input.
To navigate to the help page, use
?read.csv
Focus on the colClasses
argument. It is a vector that describes the class each column should be interpretted as. We’d like to read the columns as character, factor, integer, factor. Start by creating a vector of desired values
c("character", "factor", "integer", "factor")
## [1] "character" "factor" "integer" "factor"
Then add another argument to our use of read.csv()
, specifying the desired column classes as this vector.
pdata <- read.csv(
fname,
colClasses = c("character", "factor", "integer", "factor")
)
summary(pdata)
## Sample sex age mol.biol
## Length:128 F :42 Min. : 5.00 ALL1/AF4:10
## Class :character M :83 1st Qu.:19.00 BCR/ABL :37
## Mode :character NA's: 3 Median :29.00 E2A/PBX1: 5
## Mean :32.37 NEG :74
## 3rd Qu.:45.50 NUP-98 : 1
## Max. :58.00 p15/p16 : 1
## NA's :5
Subsetting
A basic operation in R is to subset data. A data.frame
is a two-dimensional object, so it is subset by specifying the desired rows and columns. Several different ways of specifying rows and colums are possible. Row selection often involves numeric vectors, logical vectors, and character vectors (selecting the rows with the corresponding rownames()
). Column selection is most often with a character vector, but can also be numeric.
pdata[1:5, c("sex", "mol.biol")]
## sex mol.biol
## 1 M BCR/ABL
## 2 M NEG
## 3 F BCR/ABL
## 4 M ALL1/AF4
## 5 M NEG
pdata[1:5, c(2, 3)]
## sex age
## 1 M 53
## 2 M 19
## 3 F 52
## 4 M 38
## 5 M 57
Omitting either the row or column index selects all rows or columns
pdata[1:5, ]
## Sample sex age mol.biol
## 1 01005 M 53 BCR/ABL
## 2 01010 M 19 NEG
## 3 03002 F 52 BCR/ABL
## 4 04006 M 38 ALL1/AF4
## 5 04007 M 57 NEG
The subset operator [
generally returns the same type of object as being subset, e.g., above all return values are data.frame
objects. It is sometimes desirable to select a single column as a vector. This can be done using $
or [[
; note that some values are NA
, indicating that the patient’s age is “not available”.
pdata$age
## [1] 53 19 52 38 57 17 18 16 15 40 33 55 5 18 41 27 27 46 37 36 53 39 53
## [24] 20 44 28 58 43 48 58 19 26 19 32 17 45 20 16 51 57 29 16 32 15 NA 21
## [47] 49 38 17 26 48 16 18 17 22 47 21 54 26 19 47 18 52 27 52 18 18 23 16
## [70] NA 54 25 31 19 24 23 NA 41 37 54 18 19 43 53 50 54 53 49 20 26 22 36
## [93] 27 50 NA 31 16 48 17 40 22 30 18 22 50 41 40 28 25 16 31 14 24 19 37
## [116] 23 30 48 22 41 52 32 24 37 19 30 29 NA
pdata[["age"]]
## [1] 53 19 52 38 57 17 18 16 15 40 33 55 5 18 41 27 27 46 37 36 53 39 53
## [24] 20 44 28 58 43 48 58 19 26 19 32 17 45 20 16 51 57 29 16 32 15 NA 21
## [47] 49 38 17 26 48 16 18 17 22 47 21 54 26 19 47 18 52 27 52 18 18 23 16
## [70] NA 54 25 31 19 24 23 NA 41 37 54 18 19 43 53 50 54 53 49 20 26 22 36
## [93] 27 50 NA 31 16 48 17 40 22 30 18 22 50 41 40 28 25 16 31 14 24 19 37
## [116] 23 30 48 22 41 52 32 24 37 19 30 29 NA
We can use class()
to determine the type of each column
class(pdata$age)
## [1] "integer"
as well as other functions to manipulate or summarize the data. What do each of the following lines do?
table(pdata$mol.biol)
##
## ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16
## 10 37 5 74 1 1
table(is.na(pdata$age))
##
## FALSE TRUE
## 123 5
levels(pdata$sex)
## [1] "F" "M"
Basic data manipulations
We’ve seen (e.g., log()
, above) that numeric vectors can be transformed by mathematical functions. Similar functions are available for other data types. A very common scenario is to create logical vectors that are then used to subset objects. Here we compare each element of the sex
column to "F"
. The result indicates which rows in the pdata
correspond to indivdudals with sex "F"
.
pdata$sex == "F"
## [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [34] FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
## [45] NA FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [56] TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## [78] FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
## [89] FALSE TRUE FALSE FALSE TRUE FALSE NA FALSE TRUE TRUE FALSE
## [100] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
## [111] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE NA
A more elaborate example identifies rows corresponding to females greater than 50 years of age. Note that some comparisons are NA
; why is that?
(pdata$sex == "F") & (pdata$age > 50)
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [45] NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE NA
Another elaborate comparison is %in%
, which asks whether each element in the vector on the left-hand side of %in%
is contained in the set of values on the right-hand side, for example, which elements of the pdata$mol.biol
column are in the set "BCR/ABL"
or "NEG"
?
table( pdata$mol.biol )
##
## ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16
## 10 37 5 74 1 1
pdata$mol.biol %in% c("BCR/ABL", "NEG")
## [1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## [34] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [45] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
## [56] TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
## [67] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
## [78] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [89] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
## [100] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [111] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [122] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The subset()
function can be used to perform the subset, e.g., selecting all records of females older than 50
subset(pdata, sex == "F" & age > 50)
## Sample sex age mol.biol
## 3 03002 F 52 BCR/ABL
## 27 16004 F 58 ALL1/AF4
## 30 20002 F 58 BCR/ABL
## 39 24011 F 51 BCR/ABL
## 58 28021 F 54 BCR/ABL
## 63 28032 F 52 ALL1/AF4
## 71 30001 F 54 BCR/ABL
## 84 57001 F 53 NEG
The return value of subset()
can be assigned to a variable, so the below creates a subset of pdata
corresponding to indivdiuals whose mol.biol
is either "BCR/ABL"
or "NEG"
; the result is assigned to a variable bcrabl
.
bcrabl <- subset(pdata, mol.biol %in% c("BCR/ABL", "NEG"))
dim( bcrabl )
## [1] 111 4
Here’s a tabular summary of the levels of the mol.biol
factor in bcrabl
table(bcrabl$mol.biol)
##
## ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16
## 0 37 0 74 0 0
Note that the factor includes levels that are not actually present in the subset. For our work below, we’d like to remove the unused levels. This can be done by calling the factor()
function
factor(bcrabl$mol.biol)
## [1] BCR/ABL NEG BCR/ABL NEG NEG NEG NEG NEG
## [9] BCR/ABL BCR/ABL NEG NEG BCR/ABL NEG BCR/ABL BCR/ABL
## [17] BCR/ABL BCR/ABL NEG BCR/ABL BCR/ABL NEG BCR/ABL NEG
## [25] BCR/ABL NEG BCR/ABL NEG BCR/ABL BCR/ABL NEG BCR/ABL
## [33] BCR/ABL BCR/ABL NEG BCR/ABL NEG NEG NEG BCR/ABL
## [41] BCR/ABL BCR/ABL NEG NEG NEG NEG BCR/ABL BCR/ABL
## [49] NEG NEG NEG NEG BCR/ABL NEG NEG NEG
## [57] NEG NEG BCR/ABL BCR/ABL NEG NEG BCR/ABL BCR/ABL
## [65] NEG NEG NEG NEG BCR/ABL NEG BCR/ABL BCR/ABL
## [73] BCR/ABL NEG NEG BCR/ABL NEG BCR/ABL BCR/ABL NEG
## [81] NEG NEG NEG NEG NEG NEG NEG NEG
## [89] NEG NEG NEG NEG NEG NEG NEG NEG
## [97] NEG NEG NEG NEG NEG NEG NEG NEG
## [105] NEG NEG NEG NEG NEG NEG NEG
## Levels: BCR/ABL NEG
We can then update the mol.biol
column of bcrabl
by assigning new values to it with <-
.
bcrabl$mol.biol <- factor(bcrabl$mol.biol)
table(bcrabl$mol.biol)
##
## BCR/ABL NEG
## 37 74
Exploratory data analysis
A very useful concept in R is the formula
. This is specified in a form like lhs ~ rhs
. The left-hand side is typically a response (dependent) variable, and the right-hand side a description of the independent variables that one is interested in. I ‘say’ a formula like age ~ mol.biol
as “age as a function of mol.biol”.
To get a sense of the use of formulas in R, and the ease with which one can explore data, use the boxplot()
function to visualize the relationship between age as a function of molecular biology.
boxplot(age ~ mol.biol, bcrabl)
Consult the help page ?boxplot
to figure out how to add “Age” as the y-axis label.
The boxplot suggests that in our sample individuals with BCR/ABL are on average older than individuals classified as NEG. Confirm this using t.test()
.
t.test(age ~ mol.biol, bcrabl)
##
## Welch Two Sample t-test
##
## data: age by mol.biol
## t = 4.8172, df = 68.529, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.13507 17.22408
## sample estimates:
## mean in group BCR/ABL mean in group NEG
## 40.25000 28.07042
It might be surprising that the df
(degrees of freedom) are not a round number, but 68.592. This is because by default R performs a t-test that allows for variances to differ between groups. Consult the ?t.test
help page to figure out to perform a simpler form of the t-test, where variances are equal.
Writing R scripts
Using packages
R functions are made available in packages. All the functions that we have used so far are implemented in packages that are distributed with R. There are actually more than 15,000 R packages available for a wide variety of purposes. The general challenge we explore here is how to discover, install, and use one of these packages.
Discover
The most common place to find R packages in on the comprehensive R archive network, CRAN.
Visit CRAN at https://cran.r-project.org, click on the ‘Packages’ link, and the table of available packages sorted by name. Find the ggplot2 package and peruse the information you are presented with. What does this package do?
The large number of R packages can make finding the ‘best’ packages for a particular purpose difficult to identify. From https://cran.r-project.org, click on the ‘Task Views’ link and explore a topic interesting to you.
Install (once only per R installation)
One useful packages have been identified, they need to be installed into R. This only needs to be done once per R installation. The following installs the ggplot2 and tibble packages; you DO NOT need to do this command because ggplot2 is already installed on your AMI.
### No need to do this...
install.packages(c("ggplot2", "tibble"))
Use
Packages that are installed on your system are not available for use until they are attached to the current R session. Use library()
to attach the ggplot2 package.
library("ggplot2")
Look up help on ?ggplot
. Peruse especially the examples at the bottom of the help page. The idea is that one identifies the data set to be used, and the variables with the data to be used as the x and y ‘aesthetics’
ggplot(brcabl, aes(x = mol.biol, y = age))
One then adds ‘geoms’ to the plot, for instance the boxplot geom
ggplot(bcrabl, aes(x = mol.biol, y = age)) + geom_boxplot()
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
Use xlab()
and ylab()
to enhance the plot with more meaning x and y labels; see the help page ?xlab
, especially the example section, for hints on how to modify the graph.
The grammar of graphics implemented by ggplot2 is a very powerful and expressive system for producing appealing visualizations, and is widely used in the R community.
Introduction to Bioconductor
Bioconductor is a collection of more than 1,500 packages for the statistical analysis and comprehension of high-throughput genomic data. Originally developed for microarrays, Bioconductor packages are now used in a wide range of analyses, including bulk and single-cell RNA-seq, ChIP seq, copy number analysis, microarray methylation and classic expression analysis, flow cytometry, and many other domains.
This section of the workshop introduces the essential of Bioconductor package discovery, installation, and use.
Discovering, installing, and learning how to use Bioconductor packages
Discovery
The web site at https://bioconductor.org contains descriptions of all Bioconductor packages, as well as essential reference material for all levels of user.
Packages available in Bioconductor are summarized at https://bioconductor.org/packages, also linked from the front page of the web site. The widget on the left summarizes four distinct types of Bioconductor packages
‘Software’ packages implement particular analyses or other functionality, e.g., querying web-based resources or importing common file formats to Bioconductor objects.
‘Annotation’ packages contain data that can be helpful in placing analysis results in context, for example: mapping between gene symbols such as “BRCA1” and Ensembl or Entrez gene identifiers; classifying genes in terms of gene ontology; describing the genomic coordinates of exons, transcripts, and genes; and representing whole genome sequences of common organisms.
‘Experiment data’ packages contain highly curated data sets that can be useful for learning and teaching (e.g., the airway package and data set used in the DESeq2 package for the analysis of bulk RNA-seq differential expression) or placing results in context (e.g., the curatedTCGAData package for conveniently accessing TCGA data in a way that allows very smooth integeration into Bioconductor analyses).
‘Workflow’ packages that summarize common work flows, e.g., simpleSingleCell for single-cell RNA-seq expression analysis.
Installation
Like CRAN packages, Bioconductor packages need to be installed only once per R installation, and then attached to each session where they are going to be used.
Bioconductor packages are installed slightly differently from CRAN packages. The first step is to install the BiocManager package from CRAN.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager", repos="https://cran.r-project.org")
The next step is to install the desired Bioconductor packages. The syntax to install the rtracklayer and GenomicRanges packages is
BiocManager::install(c("rtracklayer", "GenomicRanges"))
Bioconductor packages tend to depend on one another quite alot, so it is important that the correct versions of all packages are installed. Validate your installation (not necessary during the course) with
BiocManager::valid()
A convenient function in BiocManager is available()
, which accepts a regular expression to find matching packages. The following finds all ‘TxDb’ packages (describing exon, transcript, and gene coordinates) for Homo sapiens.
BiocManager::available("TxDb.Hsapiens")
## [1] "TxDb.Hsapiens.BioMart.igis"
## [2] "TxDb.Hsapiens.UCSC.hg18.knownGene"
## [3] "TxDb.Hsapiens.UCSC.hg19.knownGene"
## [4] "TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts"
## [5] "TxDb.Hsapiens.UCSC.hg38.knownGene"
Learning and support
Each package is linked to a ‘landing page’, e.g., DESeq2 that contains a description of the package, authors, perhaps literature citations where the software is described, and installation instructions.
An important part of Bioconductor packages are ‘vignettes’ which describe how the package is to be used. Vignettes are linked from package landing pages, and are available from within R using
browseVignettes("simpleSingleCell")
Users can get support on using packages at https://support.bioconductor.org, a question-and-answer style forum where response usually come quickly and often from very knowledgable users or the package developer. There are many additional sources of support, including course material linked from the home page.
Working with Genomic Ranges
This section introduces two useful packages for general-purpose work on genomic coordinates. The rtracklayer package provides the import()
function to read many types of genomic files (e.g., BED, GTF, VCF, FASTA) into Bioconductor objects. The GenomicRanges package provides functions for manipulating genomic ranges, i.e., descriptions of exons, genes, ChIP peaks, called variants, … as coordinates in genome space.
Start by attaching the rtracklayer and GenomicRanges packages to our session.
library("rtracklayer")
library("GenomicRanges")
Importing data
We’ll read in a BED file derived from the UCSC genome browser. The file contains the coordinates of all CpG islands in the human genome, and is described at the UCSC table browser. Here are the first few lines of the file, giving the chromosme, start and end coordinates, and identifier of each CpG island.
chr1 155188536 155192004 CpG:_361
chr1 2226773 2229734 CpG:_366
chr1 36306229 36307408 CpG:_110
chr1 47708822 47710847 CpG:_164
chr1 53737729 53739637 CpG:_221
chr1 144179071 144179313 CpG:_20
Use file.choose()
to find the file
fname <- file.choose() # CpGislands.Hsapiens.hg38.UCSC.bed
fname
## [1] "/usr/local/lib/R/site-library/BiocIntro/vignettes/CpGislands.Hsapiens.hg38.UCSC.bed"
file.exists(fname)
## [1] TRUE
Then use import()
from rtracklayer to read the data into R. The end result is a GenomicRanges
object describing each CpG island.
cpg <- import(fname)
cpg
## GRanges object with 30477 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 155188537-155192004 * | CpG:_361
## [2] chr1 2226774-2229734 * | CpG:_366
## [3] chr1 36306230-36307408 * | CpG:_110
## [4] chr1 47708823-47710847 * | CpG:_164
## [5] chr1 53737730-53739637 * | CpG:_221
## ... ... ... ... . ...
## [30473] chr22_KI270734v1_random 131010-132049 * | CpG:_102
## [30474] chr22_KI270734v1_random 161257-161626 * | CpG:_55
## [30475] chr22_KI270735v1_random 17221-18098 * | CpG:_100
## [30476] chr22_KI270738v1_random 4413-5280 * | CpG:_80
## [30477] chr22_KI270738v1_random 6226-6467 * | CpG:_34
## -------
## seqinfo: 254 sequences from an unspecified genome; no seqlengths
Closely compare the coordinates of the first few ranges from the file with the first few ranges in the Bioconductor representation. The BED format specification says that coordinates are 0-based, and intervals are half-open (the ‘start’ coordinate is in the range, the ‘end’ coordinate is immediately after the range; this makes some computations easy). Bioconductor’s convention is that coordinates are 1-based and closed (i.e., both start and end coordinates are included in the range). rtracklayer’s import()
function has adjusted coordinates to follow Bioconductor conventions.
Working with genomic ranges
For convenience and to illustrate functionality, let’s work only with the ‘standard’ chromosomes 1 - 22 autosomal, X, and Y chromosomes. Look up the help page ?keepStandardChromosomes
for an explanation of pruning.mode=
.
cpg <- keepStandardChromosomes(cpg, pruning.mode = "coarse")
cpg
## GRanges object with 27949 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 155188537-155192004 * | CpG:_361
## [2] chr1 2226774-2229734 * | CpG:_366
## [3] chr1 36306230-36307408 * | CpG:_110
## [4] chr1 47708823-47710847 * | CpG:_164
## [5] chr1 53737730-53739637 * | CpG:_221
## ... ... ... ... . ...
## [27945] chr22 50704375-50704880 * | CpG:_38
## [27946] chr22 50710878-50711294 * | CpG:_41
## [27947] chr22 50719959-50721632 * | CpG:_180
## [27948] chr22 50730600-50731304 * | CpG:_65
## [27949] chr22 50783345-50783889 * | CpG:_63
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
There are two parts to a GenomicRanges
object. The seqnames
(chromosomes, in the present case), start and end coordinates, and strand are required. Additional elements such as name
in the current example are optional.
Required components are accessed by functions such as start()
, end()
and width()
. Optional components can be accessed using the $
notation.
head( start(cpg) )
## [1] 155188537 2226774 36306230 47708823 53737730 144179072
head( cpg$name )
## [1] "CpG:_361" "CpG:_366" "CpG:_110" "CpG:_164" "CpG:_221" "CpG:_20"
Use the width()
accessor function to extract a vector of widths of each CpG island. Transform the values using log10()
, and visualize the distribution using hist()
.
hist(log10(width(cpg)))
Use subset()
to select the CpG islands on chromosomes 1 and 2.
subset(cpg, seqnames %in% c("chr1", "chr2"))
## GRanges object with 4217 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 155188537-155192004 * | CpG:_361
## [2] chr1 2226774-2229734 * | CpG:_366
## [3] chr1 36306230-36307408 * | CpG:_110
## [4] chr1 47708823-47710847 * | CpG:_164
## [5] chr1 53737730-53739637 * | CpG:_221
## ... ... ... ... . ...
## [4213] chr2 242003256-242004412 * | CpG:_79
## [4214] chr2 242006590-242010686 * | CpG:_263
## [4215] chr2 242045491-242045723 * | CpG:_16
## [4216] chr2 242046615-242047706 * | CpG:_170
## [4217] chr2 242088150-242089411 * | CpG:_149
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
Genomic annotations
Earlier we mentioned ‘Annotation data’ packages. An example is the TxDb.* family of packages. These packages contain information on the genomic coordinates of exons, genes, transcripts, etc. Attach the TxDb package corresponding to the Homo sapiens hg38 genome build using the UCSC ‘knownGene’ track.
library("TxDb.Hsapiens.UCSC.hg38.knownGene")
Extract the coordinates of all transcripts
tx <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
tx
## GRanges object with 226811 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr1 11869-14409 + | 1
## [2] chr1 12010-13670 + | 2
## [3] chr1 29554-31097 + | 3
## [4] chr1 30267-31109 + | 4
## [5] chr1 30366-30503 + | 5
## ... ... ... ... . ...
## [226807] chrUn_GL000220v1 155997-156149 + | 226807
## [226808] chrUn_KI270442v1 380608-380726 + | 226808
## [226809] chrUn_KI270442v1 217250-217401 - | 226809
## [226810] chrUn_KI270744v1 51009-51114 - | 226810
## [226811] chrUn_KI270750v1 148668-148843 + | 226811
## tx_name
## <character>
## [1] ENST00000456328.2
## [2] ENST00000450305.2
## [3] ENST00000473358.1
## [4] ENST00000469289.1
## [5] ENST00000607096.1
## ... ...
## [226807] ENST00000619779.1
## [226808] ENST00000620265.1
## [226809] ENST00000611690.1
## [226810] ENST00000616830.1
## [226811] ENST00000612925.1
## -------
## seqinfo: 595 sequences (1 circular) from hg38 genome
Keep only the standard chromosomes, to work smoothly with our cpg
object.
tx <- keepStandardChromosomes(tx, pruning.mode="coarse")
tx
## GRanges object with 206694 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 11869-14409 + | 1 ENST00000456328.2
## [2] chr1 12010-13670 + | 2 ENST00000450305.2
## [3] chr1 29554-31097 + | 3 ENST00000473358.1
## [4] chr1 30267-31109 + | 4 ENST00000469289.1
## [5] chr1 30366-30503 + | 5 ENST00000607096.1
## ... ... ... ... . ... ...
## [206690] chrM 5826-5891 - | 206690 ENST00000387409.1
## [206691] chrM 7446-7514 - | 206691 ENST00000387416.2
## [206692] chrM 14149-14673 - | 206692 ENST00000361681.2
## [206693] chrM 14674-14742 - | 206693 ENST00000387459.1
## [206694] chrM 15956-16023 - | 206694 ENST00000387461.2
## -------
## seqinfo: 25 sequences (1 circular) from hg38 genome
Overlaps
A very useful operation is to count overlaps in two distinct genomic ranges objects. The following counts the number of CpG islands that overlap each transcript. Related functions include findOverlaps()
, nearest()
, precede()
, and follow()
.
olaps <- countOverlaps(tx, cpg)
length(olaps) # 1 count per transcript
## [1] 206694
table(olaps)
## olaps
## 0 1 2 3 4 5 6 7 8 9
## 111952 76053 11809 3494 1458 684 394 253 169 92
## 10 11 12 13 14 15 16 17 18 19
## 66 40 43 30 22 20 20 7 14 7
## 20 21 22 23 24 25 26 27 28 29
## 7 3 7 6 3 1 4 1 3 1
## 30 31 32 33 34 35 36 37 38 63
## 6 5 3 3 1 2 3 1 1 4
## 65 71
## 1 1
Calculations such as countOverlaps()
can be added to the GRanges
object, tightly coupling derived data with the original annotation.
tx$cpgOverlaps <- countOverlaps(tx, cpg)
tx
## GRanges object with 206694 ranges and 3 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 11869-14409 + | 1 ENST00000456328.2
## [2] chr1 12010-13670 + | 2 ENST00000450305.2
## [3] chr1 29554-31097 + | 3 ENST00000473358.1
## [4] chr1 30267-31109 + | 4 ENST00000469289.1
## [5] chr1 30366-30503 + | 5 ENST00000607096.1
## ... ... ... ... . ... ...
## [206690] chrM 5826-5891 - | 206690 ENST00000387409.1
## [206691] chrM 7446-7514 - | 206691 ENST00000387416.2
## [206692] chrM 14149-14673 - | 206692 ENST00000361681.2
## [206693] chrM 14674-14742 - | 206693 ENST00000387459.1
## [206694] chrM 15956-16023 - | 206694 ENST00000387461.2
## cpgOverlaps
## <integer>
## [1] 0
## [2] 0
## [3] 1
## [4] 0
## [5] 0
## ... ...
## [206690] 0
## [206691] 0
## [206692] 0
## [206693] 0
## [206694] 0
## -------
## seqinfo: 25 sequences (1 circular) from hg38 genome
It is then possible to perform coordinated actions, e.g., subsetting the GRanges
objects for transcripts satisfying particular conditions, in a coordinated fashion where the software does all the book-keeping to makes sure the correct ranges are selected.
subset(tx, cpgOverlaps > 10)
## GRanges object with 270 ranges and 3 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 2050470-2185395 + | 263 ENST00000378567.7
## [2] chr1 2050485-2146108 + | 264 ENST00000468310.5
## [3] chr1 2073462-2185390 + | 269 ENST00000400921.6
## [4] chr1 2073986-2185190 + | 271 ENST00000461106.6
## [5] chr1 3069168-3434342 + | 352 ENST00000511072.5
## ... ... ... ... . ... ...
## [266] chrX 40051246-40177329 - | 203175 ENST00000378455.8
## [267] chrX 40051248-40177329 - | 203177 ENST00000342274.8
## [268] chrX 40062955-40177320 - | 203180 ENST00000406200.2
## [269] chrY 333963-386710 - | 206258 ENST00000390665.8
## [270] chrY 344896-386955 - | 206264 ENST00000445792.7
## cpgOverlaps
## <integer>
## [1] 15
## [2] 11
## [3] 13
## [4] 13
## [5] 32
## ... ...
## [266] 11
## [267] 11
## [268] 11
## [269] 14
## [270] 12
## -------
## seqinfo: 25 sequences (1 circular) from hg38 genome
Can you think of other situations where one might calculate derived values and couple these with GRanges
or similar objects?
Working with summarized experimental data
This section introduces another broadly useful package and data structure, the SummarizedExperiment package and SummarizedExperiment
object.
The SummarizedExperiment
object has matrix-like properties – it has two dimensions and can be subset by ‘rows’ and ‘columns’. The assay()
data of a SummarizedExperiment
experiment contains one or more matrix-like objects where rows represent features of interest (e.g., genes), columns represent samples, and elements of the matrix represent results of a genomic assay (e.g., counts of reads overlaps genes in each sample of an bulk RNA-seq differential expression assay.
Object construction
The SummarizedExperiment
coordinates assays with (optional) descriptions of rows and columns. We start by reading in a simple data.frame
describing 8 samples from an RNASeq experiment looking at dexamethasone treatment across 4 human smooth muscle cell lines; use browseVignettes("airway")
for a more complete description of the experiment and data processing. Read the column data in using file.choose()
and read.csv()
.
fname <- file.choose() # airway_colData.csv
fname
We want the first column the the data to be treated as row names (sample identifiers) in the data.frame
, so read.csv()
has an extra argument to indicate this.
colData <- read.csv(fname, row.names = 1)
colData
## SampleName cell dex albut Run avgLength Experiment
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 SRX384358
## Sample BioSample
## SRR1039508 SRS508568 SAMN02422669
## SRR1039509 SRS508567 SAMN02422675
## SRR1039512 SRS508571 SAMN02422678
## SRR1039513 SRS508572 SAMN02422670
## SRR1039516 SRS508575 SAMN02422682
## SRR1039517 SRS508576 SAMN02422673
## SRR1039520 SRS508579 SAMN02422683
## SRR1039521 SRS508580 SAMN02422677
The data are from the Short Read Archive, and the row names, SampleName
, Run
, Experiment
, Sampel
, and BioSample
columns are classifications from the archive. Additional columns include:
cell
: the cell line used. There are four cell lines.dex
: whether the sample was untreated, or treated with dexamethasone.albut
: a second treatment, which we ignoreavgLength
: the sample-specific average length of the RNAseq reads estimated in the experiment.
Assay data
Now import the assay data from the file “airway_counts.csv”
fname <- file.choose() # airway_counts.csv
fname
counts <- read.csv(fname, row.names=1)
Although the data are read as a data.frame
, all columns are of the same type (integer-valued) and represent the same attribute; the data is really a matrix
rather than data.frame
, so we coerce to matrix using as.matrix()
.
counts <- as.matrix(counts)
We see the dimensions and first few rows of the counts matrix
dim(counts)
## [1] 33469 8
head(counts)
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 679 448 873 408 1138
## ENSG00000000419 467 515 621 365 587
## ENSG00000000457 260 211 263 164 245
## ENSG00000000460 60 55 40 35 78
## ENSG00000000938 0 0 2 0 1
## ENSG00000000971 3251 3679 6177 4252 6721
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 1047 770 572
## ENSG00000000419 799 417 508
## ENSG00000000457 331 233 229
## ENSG00000000460 63 76 60
## ENSG00000000938 0 0 0
## ENSG00000000971 11027 5176 7995
It’s interesting to think about what the counts mean – for ENSG00000000003, sample SRR1039508 had 679 reads that overlapped this gene, sample SRR1039509 had 448 reads, etc. Notice that for this gene there seems to be a consistent pattern – within a cell line, the read counts in the untreated group are always larger than the read counts for the treated group. This and other basic observations from ‘looking at’ the data motivate many steps in a rigorous RNASeq differential expression analysis.
Creating a SummarizedExperiment
object
We saw earlier that there was considerable value in tightly coupling the count of CpG islands overlapping each transcript with the GRanges
describing the transcripts. We can anticipate that close coupling of the column data with the assay data will have similar benefits, e.g., reducing the chances of bookkeeping errors as we work with our data.
Attach the SummarizedExperiment library to our R session.
library("SummarizedExperiment")
Use the SummarizedExperiment()
function to coordinate the assay and column data; this function uses row and column names to make sure the correct assay columns are desribed by the correct column data rows.
se <- SummarizedExperiment(assay = list(count=counts), colData = colData)
se
## class: SummarizedExperiment
## dim: 33469 8
## metadata(0):
## assays(1): count
## rownames(33469): ENSG00000000003 ENSG00000000419 ...
## ENSG00000273492 ENSG00000273493
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
It is straight-forward to use subset()
on SummarizedExperiment
to create subsets of the data in a coordinated way. Remember that a SummarizedExperiment
is conceptually two-dimensional (matrix-like), and in the example below we are subsetting on the second dimension.
subset(se, , dex == "trt")
## class: SummarizedExperiment
## dim: 33469 4
## metadata(0):
## assays(1): count
## rownames(33469): ENSG00000000003 ENSG00000000419 ...
## ENSG00000273492 ENSG00000273493
## rowData names(0):
## colnames(4): SRR1039509 SRR1039513 SRR1039517 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
As with GRanges
, there are accessors that extract data from the SummarizedExperiment
. For instance, we can use assay()
to extract the count matrix, and colSums()
to calculate the library size (total number of reads overlapping genes in each sample).
colSums(assay(se))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## 20637971 18809481 25348649 15163415 24448408 30818215
## SRR1039520 SRR1039521
## 19126151 21164133
Note that library sizes differ by a factor of 2 from largest to smallest; how would this influence the interpretation of counts in individual cells of the assay data?
As with GRanges
, it might be useful to remember important computations in a way that is robust, e.g.,
se$lib.size <- colSums(assay(se))
colData(se)
## DataFrame with 8 rows and 10 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample lib.size
## <factor> <factor> <factor> <numeric>
## SRR1039508 SRX384345 SRS508568 SAMN02422669 20637971
## SRR1039509 SRX384346 SRS508567 SAMN02422675 18809481
## SRR1039512 SRX384349 SRS508571 SAMN02422678 25348649
## SRR1039513 SRX384350 SRS508572 SAMN02422670 15163415
## SRR1039516 SRX384353 SRS508575 SAMN02422682 24448408
## SRR1039517 SRX384354 SRS508576 SAMN02422673 30818215
## SRR1039520 SRX384357 SRS508579 SAMN02422683 19126151
## SRR1039521 SRX384358 SRS508580 SAMN02422677 21164133
Down-stream analysis
In this final section we quickly hint at down-stream analysis, and the way in which skills learned in working with Bioconductor objects in one package translate to working with objects in other packages.
We start by loading the DESeq2 package, a very popular facility for analysing bulk RNAseq differential experssion data.
library("DESeq2")
The package requires count data like that in the SummarizedExperiment
we have been working with, in addition to a formula
describing the experimental design. Some of the observations above suggest that we should include cell line as a covariate, and dexamethazone treatment as the main factor that we are interested in.
dds <- DESeqDataSet(se, design = ~ cell + dex)
## renaming the first element in assays to 'counts'
dds
## class: DESeqDataSet
## dim: 33469 8
## metadata(1): version
## assays(1): counts
## rownames(33469): ENSG00000000003 ENSG00000000419 ...
## ENSG00000273492 ENSG00000273493
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(10): SampleName cell ... BioSample lib.size
The dds
object can be manipulated very much like a SummarizedExperiment
.
The essention DESeq work flow is summarized by a single function call, which performs advanced statistical analysis on the data in the dds
object.
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
A table summarizing measures of differential expression can be extracted from the object, and visualized or manipulated using commands we learned earlier today.
results(dds)
## log2 fold change (MLE): dex untrt vs trt
## Wald test p-value: dex untrt vs trt
## DataFrame with 33469 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## ENSG00000000003 708.602169691234 0.381253982523054 0.100654411865816
## ENSG00000000419 520.297900552084 -0.20681260108505 0.112218646507255
## ENSG00000000457 237.163036796015 -0.0379204312050838 0.143444654934085
## ENSG00000000460 57.9326331250967 0.0881681758701309 0.287141822937772
## ENSG00000000938 0.318098378392895 1.37822703433213 3.49987280259267
## ... ... ... ...
## ENSG00000273487 8.1632349843654 -1.04640654119647 0.698966822033151
## ENSG00000273488 8.58447903624707 -0.107830154768606 0.638099034888812
## ENSG00000273489 0.275899382507797 -1.48373838425916 3.51394583815493
## ENSG00000273492 0.105978355992386 0.463679009985687 3.52308267589304
## ENSG00000273493 0.106141666408122 0.521354507611989 3.53139024554344
## stat pvalue padj
## <numeric> <numeric> <numeric>
## ENSG00000000003 3.78775232457084 0.000152016273044796 0.0012742310175221
## ENSG00000000419 -1.84294328546975 0.0653372915097624 0.195433088560707
## ENSG00000000457 -0.264355832725233 0.791505741608341 0.910602138773159
## ENSG00000000460 0.307054454722321 0.758801923982939 0.893977566267194
## ENSG00000000938 0.393793463954221 0.693733530342183 NA
## ... ... ... ...
## ENSG00000273487 -1.49707612466166 0.134373451371486 0.327756206316493
## ENSG00000273488 -0.168986550477067 0.865807220420438 0.944617232242561
## ENSG00000273489 -0.422242815511986 0.672847792938571 NA
## ENSG00000273492 0.131611731157615 0.895291406125436 NA
## ENSG00000273493 0.147634351165219 0.882631343839791 NA