Introduction
This tutorial will show you how to think about and perform common genomic data wrangling tasks using the using the plyranges and GenomicRanges Bioconductor packages. These packages represent genomic data using the GRanges data structure, which provides a powerful abstraction for working with measurements or annotations along genomic regions. In this tutorial you will use GRanges object to construct workflows to
- read data from common formats like BED
- find overlaps between different sets of genomic regions
- compute coverage and related summaries
Who is this tutorial is for?
If you are an R beginner or new to the Bioconductor ecosystem, this tutorial provides a comprehensive overview of one of the most important data structures in Bioconductor, GRanges. We show you how to think of this data structure as a special kind of table and perform all kinds of operations on it.
If you are familiar with the tidyverse suite of packages, then the operations and workflows we will develop will look somewhat familiar to you. However, due to the complexity of genomics data, plyranges develops a new grammar based on the tidyverse to meet the needs of the computational biologist.
We assume you know what a data.frame is and have some knowledge of vector types in R.
The examples and and exercises are based on the HelloRanges package tutorial. If you are familiar with bedtools, this is an excellent resource for learning how similar operations are performed with Bioconductor packages.
Setup
You will need R (version 3.5 or greater) with the following packages installed:
# Bioconductor
library(GenomicRanges)
library(plyranges)
library(HelloRangesData)
If you do not have these packages you can install them as follows:
# -> Bioconductor packages are installed with this package
install.packages("BiocManager")
# then use the install function from BiocManager
BiocManager::install(c("GenomicRanges", "plyranges", "HelloRangesData"))
Introducing GRanges
The GRanges is a fundamental data structure used in the Bioconductor ecosystem. It is very similar to the base R data.frame but with appropriate semantics for a genomics experiment: it has fixed columns for the chromosome, start and end coordinates, and the strand, along with an arbitrary set of additional columns, consisting of measurements or metadata specific to the data type or experiment.
GRanges balances flexibility with formal constraints, so that it is applicable to virtually any genomic workflow, while also being semantically rich enough to support high-level operations on genomic ranges.
We can create a GRanges from an ordinary data.frame with plyranges using
the as_granges()
function:
library(plyranges, quietly = TRUE)
genes <- data.frame(seqnames = "VI",
start = c(3322, 3030, 1437, 5066, 6426, 836),
end = c(3846, 3338, 2615, 5521, 7565, 1363),
strand = c("-", "-", "-", "+", "+", "+"),
gene_id=c("YFL064C", "YFL065C", "YFL066C",
"YFL063W", "YFL062W", "YFL067W"),
gc = runif(6),
score = rbinom(6, size = 3, p = 0.1),
stringsAsFactors = FALSE)
genes <- as_granges(genes)
genes
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id gc score
## <Rle> <IRanges> <Rle> | <character> <numeric> <integer>
## [1] VI 3322-3846 - | YFL064C 0.672622985672206 0
## [2] VI 3030-3338 - | YFL065C 0.168389319442213 0
## [3] VI 1437-2615 - | YFL066C 0.081997754285112 1
## [4] VI 5066-5521 + | YFL063W 0.358257586602122 0
## [5] VI 6426-7565 + | YFL062W 0.557314032688737 1
## [6] VI 836-1363 + | YFL067W 0.826518925838172 0
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
You might notice a few differences between how the genes
GRanges
object is printed compared to an ordinary data.frame or tibble.
The measurements/annotation columns like gene_id, gc and score are placed on the right hand side of the “|”. These are referred to as metadata columns; they are not strictly required to construct a GRanges object but are useful for providing biological context.
The required part of a GRanges object are left of the “|” character. These are the seqnames column, the ranges (start, end, width) column, and the strand column.
You may have also noticed that underneath each column name is the type of
that column. The metadata columnns are of some usual base R types like
character
, numeric
, and integer
but the core columns may be
unfamiliar. The seqnames and strand column are (factor) Rle
columns.
Rle
stands for run length encoding, and refers to how a column is stored.
For now, there is no harm in thinking about these as ordinary factors.
The ranges column is an IRanges
which really consists
of three columns: the start, end and width coordinates of the range.
This becomes apparent if you look back to our original data.frame - the start
and end columns from there are just the start and end intervals of the ranges
column.
Do not be concerned if this a bit confusing for now, once we start playing with our GRanges object, everything will become clearer.
GRanges transformation and import
Once you have obtained data from an experiment or a public repository, you often want to compute summaries, create new columns, modify the genomic regions or just rearrange the data. Here you will learn how to achieve these tasks using the plyranges:
- Modify, subset, and aggregate genomic data with the
mutate()
,filter()
, andsummarise()
. - Modify genomic regions with the
mutate()
… - And while fixing the start/end/center coordinates with the
anchor_*
family of functions. - Merge all overlapping and adjacent genomic regions with
reduce_ranges()
. - Merge the end points of all genomic regions with
disjoin_ranges()
. - Sort genomic ranges with
arrange()
.
Any of the above operations can be performed on subsets of the
data with group_by()
.
Getting started
To begin we will test out some of these operations using a
GRanges object obtained from a BED file
containing annotated exons from RefSeq from the hg19 genome build.
To load the file as GRanges we can simply use the read_bed
function from
plyranges.
exons_bed <- system.file("extdata", "exons.bed", package="HelloRangesData")
# our genome build, requires internet connection
# build <- genome_info("hg19")
build <- system.file("extdata", "hg19.genome", package="HelloRangesData") %>%
read.delim(header = FALSE, col.names = c("seqnames", "width")) %>%
as_granges(start = 1) %>%
set_genome_info(genome = "hg19",
seqlengths = width(.))
exons <- read_bed(exons_bed,
genome_info = build)
exons
## GRanges object with 459752 ranges and 2 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 11874-12227 + |
## [2] chr1 12613-12721 + |
## [3] chr1 13221-14409 + |
## [4] chr1 14362-14829 - |
## [5] chr1 14970-15038 - |
## ... ... ... ... .
## [459748] chrY 59338754-59338859 + |
## [459749] chrY 59338754-59338859 + |
## [459750] chrY 59340194-59340278 + |
## [459751] chrY 59342487-59343488 + |
## [459752] chrY 59342487-59343488 + |
## name score
## <character> <numeric>
## [1] NR_046018_exon_0_0_chr1_11874_f 0
## [2] NR_046018_exon_1_0_chr1_12613_f 0
## [3] NR_046018_exon_2_0_chr1_13221_f 0
## [4] NR_024540_exon_0_0_chr1_14362_r 0
## [5] NR_024540_exon_1_0_chr1_14970_r 0
## ... ... ...
## [459748] NM_002186_exon_6_0_chrY_59338754_f 0
## [459749] NM_176786_exon_7_0_chrY_59338754_f 0
## [459750] NM_002186_exon_7_0_chrY_59340194_f 0
## [459751] NM_002186_exon_8_0_chrY_59342487_f 0
## [459752] NM_176786_exon_8_0_chrY_59342487_f 0
## -------
## seqinfo: 93 sequences from hg19 genome
The genome_info()
function automatically propagates the reference
annotations from the hg19 reference as a GRanges. This is important for
ensuring the integrity of any operations involving modifications to
genomic regions, like overlaps or nearest neighbours.
Filter ranges with filter()
We can filter()
ranges based on column values being TRUE
. This
requires passing the ranges, along with one or more logical expressions.
For example, we can restrict exons to those that are on the the X chromosome:
filter(exons, seqnames == "chrX")
## GRanges object with 18173 ranges and 2 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chrX 192991-193061 + |
## [2] chrX 198061-198351 + |
## [3] chrX 198149-198351 + |
## [4] chrX 200834-200981 + |
## [5] chrX 200834-200981 + |
## ... ... ... ... .
## [18169] chrX 155235748-155235853 + |
## [18170] chrX 155235748-155235853 + |
## [18171] chrX 155237188-155237272 + |
## [18172] chrX 155239481-155240482 + |
## [18173] chrX 155239481-155240482 + |
## name score
## <character> <numeric>
## [1] NR_028057_exon_0_0_chrX_192991_f 0
## [2] NM_018390_exon_0_0_chrX_198061_f 0
## [3] NR_028057_exon_1_0_chrX_198149_f 0
## [4] NM_018390_exon_1_0_chrX_200834_f 0
## [5] NR_028057_exon_2_0_chrX_200834_f 0
## ... ... ...
## [18169] NM_002186_exon_6_0_chrX_155235748_f 0
## [18170] NM_176786_exon_7_0_chrX_155235748_f 0
## [18171] NM_002186_exon_7_0_chrX_155237188_f 0
## [18172] NM_002186_exon_8_0_chrX_155239481_f 0
## [18173] NM_176786_exon_8_0_chrX_155239481_f 0
## -------
## seqinfo: 93 sequences from hg19 genome
Multiple conditions can be composed together within a call to filter by separating the expressions with a comma. To restrict exons to those that are on the X chromosome AND on the positive strand:
filter(exons, seqnames == "chrX", strand == "+")
## GRanges object with 9176 ranges and 2 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chrX 192991-193061 + |
## [2] chrX 198061-198351 + |
## [3] chrX 198149-198351 + |
## [4] chrX 200834-200981 + |
## [5] chrX 200834-200981 + |
## ... ... ... ... .
## [9172] chrX 155235748-155235853 + |
## [9173] chrX 155235748-155235853 + |
## [9174] chrX 155237188-155237272 + |
## [9175] chrX 155239481-155240482 + |
## [9176] chrX 155239481-155240482 + |
## name score
## <character> <numeric>
## [1] NR_028057_exon_0_0_chrX_192991_f 0
## [2] NM_018390_exon_0_0_chrX_198061_f 0
## [3] NR_028057_exon_1_0_chrX_198149_f 0
## [4] NM_018390_exon_1_0_chrX_200834_f 0
## [5] NR_028057_exon_2_0_chrX_200834_f 0
## ... ... ...
## [9172] NM_002186_exon_6_0_chrX_155235748_f 0
## [9173] NM_176786_exon_7_0_chrX_155235748_f 0
## [9174] NM_002186_exon_7_0_chrX_155237188_f 0
## [9175] NM_002186_exon_8_0_chrX_155239481_f 0
## [9176] NM_176786_exon_8_0_chrX_155239481_f 0
## -------
## seqinfo: 93 sequences from hg19 genome
This is equivalent to
filter(exons, seqnames == "chrX" & strand == "+")
## GRanges object with 9176 ranges and 2 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chrX 192991-193061 + |
## [2] chrX 198061-198351 + |
## [3] chrX 198149-198351 + |
## [4] chrX 200834-200981 + |
## [5] chrX 200834-200981 + |
## ... ... ... ... .
## [9172] chrX 155235748-155235853 + |
## [9173] chrX 155235748-155235853 + |
## [9174] chrX 155237188-155237272 + |
## [9175] chrX 155239481-155240482 + |
## [9176] chrX 155239481-155240482 + |
## name score
## <character> <numeric>
## [1] NR_028057_exon_0_0_chrX_192991_f 0
## [2] NM_018390_exon_0_0_chrX_198061_f 0
## [3] NR_028057_exon_1_0_chrX_198149_f 0
## [4] NM_018390_exon_1_0_chrX_200834_f 0
## [5] NR_028057_exon_2_0_chrX_200834_f 0
## ... ... ...
## [9172] NM_002186_exon_6_0_chrX_155235748_f 0
## [9173] NM_176786_exon_7_0_chrX_155235748_f 0
## [9174] NM_002186_exon_7_0_chrX_155237188_f 0
## [9175] NM_002186_exon_8_0_chrX_155239481_f 0
## [9176] NM_176786_exon_8_0_chrX_155239481_f 0
## -------
## seqinfo: 93 sequences from hg19 genome
But not the same as filtering exons to those on the X chromosome or positively stranded.
filter(exons, seqnames == "chrX" | strand == "+")
## GRanges object with 241162 ranges and 2 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 11874-12227 + |
## [2] chr1 12613-12721 + |
## [3] chr1 13221-14409 + |
## [4] chr1 69091-70008 + |
## [5] chr1 323892-324060 + |
## ... ... ... ... .
## [241158] chrY 59338754-59338859 + |
## [241159] chrY 59338754-59338859 + |
## [241160] chrY 59340194-59340278 + |
## [241161] chrY 59342487-59343488 + |
## [241162] chrY 59342487-59343488 + |
## name score
## <character> <numeric>
## [1] NR_046018_exon_0_0_chr1_11874_f 0
## [2] NR_046018_exon_1_0_chr1_12613_f 0
## [3] NR_046018_exon_2_0_chr1_13221_f 0
## [4] NM_001005484_exon_0_0_chr1_69091_f 0
## [5] NR_028322_exon_0_0_chr1_323892_f 0
## ... ... ...
## [241158] NM_002186_exon_6_0_chrY_59338754_f 0
## [241159] NM_176786_exon_7_0_chrY_59338754_f 0
## [241160] NM_002186_exon_7_0_chrY_59340194_f 0
## [241161] NM_002186_exon_8_0_chrY_59342487_f 0
## [241162] NM_176786_exon_8_0_chrY_59342487_f 0
## -------
## seqinfo: 93 sequences from hg19 genome
For a comprehensive overview of all R’s logical operators see section 5.2.2 of R for Data Science.
Overlap based filtering
We can also restrict ranges by overlaps using filter_by_overlaps()
,
or exclude non overlapping features using filter_by_non_overlaps()
.
To see how these work it’ll be useful to load another GRanges object, this time from a BED file containing CpG islands.
cpg_bed <- system.file("extdata", "cpg.bed", package = "HelloRangesData")
cpg <- read_bed(cpg_bed,
genome_info = build)
cpg
## GRanges object with 28691 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 28736-29810 * | CpG:_116
## [2] chr1 135125-135563 * | CpG:_30
## [3] chr1 327791-328229 * | CpG:_29
## [4] chr1 437152-438164 * | CpG:_84
## [5] chr1 449274-450544 * | CpG:_99
## ... ... ... ... . ...
## [28687] chrY 27610116-27611088 * | CpG:_76
## [28688] chrY 28555536-28555932 * | CpG:_32
## [28689] chrY 28773316-28773544 * | CpG:_25
## [28690] chrY 59213795-59214183 * | CpG:_36
## [28691] chrY 59349267-59349574 * | CpG:_29
## -------
## seqinfo: 93 sequences from hg19 genome
To find exons that do not overlap any cpg islands we
filter_by_non_overlaps(exons, cpg)
## GRanges object with 414647 ranges and 2 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 11874-12227 + |
## [2] chr1 12613-12721 + |
## [3] chr1 13221-14409 + |
## [4] chr1 14362-14829 - |
## [5] chr1 14970-15038 - |
## ... ... ... ... .
## [414643] chrY 59338754-59338859 + |
## [414644] chrY 59338754-59338859 + |
## [414645] chrY 59340194-59340278 + |
## [414646] chrY 59342487-59343488 + |
## [414647] chrY 59342487-59343488 + |
## name score
## <character> <numeric>
## [1] NR_046018_exon_0_0_chr1_11874_f 0
## [2] NR_046018_exon_1_0_chr1_12613_f 0
## [3] NR_046018_exon_2_0_chr1_13221_f 0
## [4] NR_024540_exon_0_0_chr1_14362_r 0
## [5] NR_024540_exon_1_0_chr1_14970_r 0
## ... ... ...
## [414643] NM_002186_exon_6_0_chrY_59338754_f 0
## [414644] NM_176786_exon_7_0_chrY_59338754_f 0
## [414645] NM_002186_exon_7_0_chrY_59340194_f 0
## [414646] NM_002186_exon_8_0_chrY_59342487_f 0
## [414647] NM_176786_exon_8_0_chrY_59342487_f 0
## -------
## seqinfo: 93 sequences from hg19 genome
Or we could find the cpg islands that are overlapped by exons
filter_by_overlaps(cpg, exons)
## GRanges object with 18845 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 28736-29810 * | CpG:_116
## [2] chr1 135125-135563 * | CpG:_30
## [3] chr1 327791-328229 * | CpG:_29
## [4] chr1 713985-714547 * | CpG:_60
## [5] chr1 762417-763445 * | CpG:_115
## ... ... ... ... . ...
## [18841] chrY 25345105-25345331 * | CpG:_21
## [18842] chrY 25365504-25365730 * | CpG:_21
## [18843] chrY 26959490-26959716 * | CpG:_21
## [18844] chrY 26979890-26980116 * | CpG:_21
## [18845] chrY 59213795-59214183 * | CpG:_36
## -------
## seqinfo: 93 sequences from hg19 genome
Modify metadata with mutate()
New metadata columns can be created using the mutate()
function. The first
argument is the GRanges object we are mutating and the remaining arguments
are name-value pairs, where name refers to name of the new column being
created and value refers to the value the new column will take.
As an example, we make a new column called total_length which contains the number of bases of in each CpG island. This is just a copy of the width column in Ranges object.
mutate(cpg, total_length = width)
## GRanges object with 28691 ranges and 2 metadata columns:
## seqnames ranges strand | name total_length
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 28736-29810 * | CpG:_116 1075
## [2] chr1 135125-135563 * | CpG:_30 439
## [3] chr1 327791-328229 * | CpG:_29 439
## [4] chr1 437152-438164 * | CpG:_84 1013
## [5] chr1 449274-450544 * | CpG:_99 1271
## ... ... ... ... . ... ...
## [28687] chrY 27610116-27611088 * | CpG:_76 973
## [28688] chrY 28555536-28555932 * | CpG:_32 397
## [28689] chrY 28773316-28773544 * | CpG:_25 229
## [28690] chrY 59213795-59214183 * | CpG:_36 389
## [28691] chrY 59349267-59349574 * | CpG:_29 308
## -------
## seqinfo: 93 sequences from hg19 genome
Multiple columns can be computed inside of a call to mutate()
by separating
each pair of name-value expressions with a comma, we could add an additional
column called mid, which is the midpoint of the CpG island.
mutate(cpg,
total_length = width,
mid = start + (width - 1) %/% 2
)
## GRanges object with 28691 ranges and 3 metadata columns:
## seqnames ranges strand | name total_length
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 28736-29810 * | CpG:_116 1075
## [2] chr1 135125-135563 * | CpG:_30 439
## [3] chr1 327791-328229 * | CpG:_29 439
## [4] chr1 437152-438164 * | CpG:_84 1013
## [5] chr1 449274-450544 * | CpG:_99 1271
## ... ... ... ... . ... ...
## [28687] chrY 27610116-27611088 * | CpG:_76 973
## [28688] chrY 28555536-28555932 * | CpG:_32 397
## [28689] chrY 28773316-28773544 * | CpG:_25 229
## [28690] chrY 59213795-59214183 * | CpG:_36 389
## [28691] chrY 59349267-59349574 * | CpG:_29 308
## mid
## <numeric>
## [1] 29273
## [2] 135344
## [3] 328010
## [4] 437658
## [5] 449909
## ... ...
## [28687] 27610602
## [28688] 28555734
## [28689] 28773430
## [28690] 59213989
## [28691] 59349420
## -------
## seqinfo: 93 sequences from hg19 genome
Moreover, mutate()
keeps track of which columnns have been created, so
you can create new columns using pre-existing ones.
mutate(cpg,
total_length = width,
mid = start + (width - 1) %/% 2,
mid2 = sqrt(mid)
)
## GRanges object with 28691 ranges and 4 metadata columns:
## seqnames ranges strand | name total_length
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 28736-29810 * | CpG:_116 1075
## [2] chr1 135125-135563 * | CpG:_30 439
## [3] chr1 327791-328229 * | CpG:_29 439
## [4] chr1 437152-438164 * | CpG:_84 1013
## [5] chr1 449274-450544 * | CpG:_99 1271
## ... ... ... ... . ... ...
## [28687] chrY 27610116-27611088 * | CpG:_76 973
## [28688] chrY 28555536-28555932 * | CpG:_32 397
## [28689] chrY 28773316-28773544 * | CpG:_25 229
## [28690] chrY 59213795-59214183 * | CpG:_36 389
## [28691] chrY 59349267-59349574 * | CpG:_29 308
## mid mid2
## <numeric> <numeric>
## [1] 29273 171.09354166654
## [2] 135344 367.891288290441
## [3] 328010 572.72157284321
## [4] 437658 661.557253758131
## [5] 449909 670.752562425221
## ... ... ...
## [28687] 27610602 5254.57914584984
## [28688] 28555734 5343.75654385564
## [28689] 28773430 5364.08706118758
## [28690] 59213989 7695.06263782173
## [28691] 59349420 7703.85747531715
## -------
## seqinfo: 93 sequences from hg19 genome
In plyranges and GenomicRanges there are additional helpers that are useful for column creation during data analysis. As an example, we can add the count of the number of times each cpg island overlaps an exon and add a logical column asking if a cpg island is contained entirely within an exon.
cpg2 <- mutate(cpg,
n_olap = count_overlaps(cpg, exons),
is_contained = cpg %within% exons
)
cpg2
## GRanges object with 28691 ranges and 3 metadata columns:
## seqnames ranges strand | name n_olap
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 28736-29810 * | CpG:_116 1
## [2] chr1 135125-135563 * | CpG:_30 1
## [3] chr1 327791-328229 * | CpG:_29 3
## [4] chr1 437152-438164 * | CpG:_84 0
## [5] chr1 449274-450544 * | CpG:_99 0
## ... ... ... ... . ... ...
## [28687] chrY 27610116-27611088 * | CpG:_76 0
## [28688] chrY 28555536-28555932 * | CpG:_32 0
## [28689] chrY 28773316-28773544 * | CpG:_25 0
## [28690] chrY 59213795-59214183 * | CpG:_36 5
## [28691] chrY 59349267-59349574 * | CpG:_29 0
## is_contained
## <logical>
## [1] FALSE
## [2] TRUE
## [3] TRUE
## [4] FALSE
## [5] FALSE
## ... ...
## [28687] FALSE
## [28688] FALSE
## [28689] FALSE
## [28690] FALSE
## [28691] FALSE
## -------
## seqinfo: 93 sequences from hg19 genome
Composing workflows with the pipe %>%
Now that you can use filter()
and mutate()
, you might like to start
thinking about how you can combine them to form workflows. One way we can do
this is with the pipe operator, %>%
.
The pipe operator is read as the word then:
# take cpg then filter
cpg %>%
filter(seqnames == "chrX")
## GRanges object with 896 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chrX 64182-64793 * | CpG:_62
## [2] chrX 69134-70029 * | CpG:_100
## [3] chrX 148686-149461 * | CpG:_85
## [4] chrX 166505-167721 * | CpG:_96
## [5] chrX 170415-170686 * | CpG:_27
## ... ... ... ... . ...
## [892] chrX 154687388-154688519 * | CpG:_131
## [893] chrX 154689347-154689788 * | CpG:_41
## [894] chrX 154842113-154842719 * | CpG:_44
## [895] chrX 155110789-155111177 * | CpG:_36
## [896] chrX 155246261-155246568 * | CpG:_29
## -------
## seqinfo: 93 sequences from hg19 genome
The argument on the left hand side of the pipe is forwarded to the first
argument of the function on the right hand side. This can be made explicit
with a .
cpg %>%
filter(., seqnames == "chrX")
## GRanges object with 896 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chrX 64182-64793 * | CpG:_62
## [2] chrX 69134-70029 * | CpG:_100
## [3] chrX 148686-149461 * | CpG:_85
## [4] chrX 166505-167721 * | CpG:_96
## [5] chrX 170415-170686 * | CpG:_27
## ... ... ... ... . ...
## [892] chrX 154687388-154688519 * | CpG:_131
## [893] chrX 154689347-154689788 * | CpG:_41
## [894] chrX 154842113-154842719 * | CpG:_44
## [895] chrX 155110789-155111177 * | CpG:_36
## [896] chrX 155246261-155246568 * | CpG:_29
## -------
## seqinfo: 93 sequences from hg19 genome
As long as the the result of applying the pipe returns a GRanges object, multiple functions can be strung together:
cpg %>%
filter(seqnames == "chrX") %>%
filter(strand == "+") %>%
mutate(new_col = runif(n()))
## GRanges object with 0 ranges and 2 metadata columns:
## seqnames ranges strand | name new_col
## <Rle> <IRanges> <Rle> | <character> <numeric>
## -------
## seqinfo: 93 sequences from hg19 genome
The use of the .
can also make pipelines more succinct, as the argument
on the left hand side can be forwarded to multiple parts of the function
on the right hand side. We can rewrite, the creation of cpg2 above more
compactly as
cpg %>%
mutate(n_olap = count_overlaps(., exons),
is_contained = . %within% exons)
## GRanges object with 28691 ranges and 3 metadata columns:
## seqnames ranges strand | name n_olap
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 28736-29810 * | CpG:_116 1
## [2] chr1 135125-135563 * | CpG:_30 1
## [3] chr1 327791-328229 * | CpG:_29 3
## [4] chr1 437152-438164 * | CpG:_84 0
## [5] chr1 449274-450544 * | CpG:_99 0
## ... ... ... ... . ... ...
## [28687] chrY 27610116-27611088 * | CpG:_76 0
## [28688] chrY 28555536-28555932 * | CpG:_32 0
## [28689] chrY 28773316-28773544 * | CpG:_25 0
## [28690] chrY 59213795-59214183 * | CpG:_36 5
## [28691] chrY 59349267-59349574 * | CpG:_29 0
## is_contained
## <logical>
## [1] FALSE
## [2] TRUE
## [3] TRUE
## [4] FALSE
## [5] FALSE
## ... ...
## [28687] FALSE
## [28688] FALSE
## [28689] FALSE
## [28690] FALSE
## [28691] FALSE
## -------
## seqinfo: 93 sequences from hg19 genome
Exercises
- Modify the above example so only CpG islands that are completely within exons are included in the result.
- Create a new column in exons called tx_id (the transcript id is everything before _exon in name).
Modifying genomic regions with mutate()
We have learnt in the above sections how to add new metadata columns with
mutate()
but we can also use `mutate() to modify the genomic regions.
Recall that there are three components that make up a genomic region; the
start, end, and the width.
Let’s go back to our original GRanges of yeast genes. What happens if we add 10 bases to the width?
genes
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id gc score
## <Rle> <IRanges> <Rle> | <character> <numeric> <integer>
## [1] VI 3322-3846 - | YFL064C 0.672622985672206 0
## [2] VI 3030-3338 - | YFL065C 0.168389319442213 0
## [3] VI 1437-2615 - | YFL066C 0.081997754285112 1
## [4] VI 5066-5521 + | YFL063W 0.358257586602122 0
## [5] VI 6426-7565 + | YFL062W 0.557314032688737 1
## [6] VI 836-1363 + | YFL067W 0.826518925838172 0
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
genes %>%
mutate(width = width + 10L)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id gc score
## <Rle> <IRanges> <Rle> | <character> <numeric> <integer>
## [1] VI 3322-3856 - | YFL064C 0.672622985672206 0
## [2] VI 3030-3348 - | YFL065C 0.168389319442213 0
## [3] VI 1437-2625 - | YFL066C 0.081997754285112 1
## [4] VI 5066-5531 + | YFL063W 0.358257586602122 0
## [5] VI 6426-7575 + | YFL062W 0.557314032688737 1
## [6] VI 836-1373 + | YFL067W 0.826518925838172 0
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
All three components are mutually dependent, so in order to preserve the integrity of the GRanges structure, either the start or the end positions must change. In the case of adding to the width position by default the start is fixed and the end is increased. But what if we would like to modify the width but leave the end of fixed? Or the midpoint fixed?
The solution is to anchor those positions (i.e. leave them fixed),
using the anchor_*()
family of functions:
# fix the end
genes %>%
anchor_end() %>%
mutate(width = width + 10L)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id gc score
## <Rle> <IRanges> <Rle> | <character> <numeric> <integer>
## [1] VI 3312-3846 - | YFL064C 0.672622985672206 0
## [2] VI 3020-3338 - | YFL065C 0.168389319442213 0
## [3] VI 1427-2615 - | YFL066C 0.081997754285112 1
## [4] VI 5056-5521 + | YFL063W 0.358257586602122 0
## [5] VI 6416-7565 + | YFL062W 0.557314032688737 1
## [6] VI 826-1363 + | YFL067W 0.826518925838172 0
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# fix the centre
genes %>%
anchor_centre() %>%
mutate(width = width + 10L)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id gc score
## <Rle> <IRanges> <Rle> | <character> <numeric> <integer>
## [1] VI 3317-3851 - | YFL064C 0.672622985672206 0
## [2] VI 3025-3343 - | YFL065C 0.168389319442213 0
## [3] VI 1432-2620 - | YFL066C 0.081997754285112 1
## [4] VI 5061-5526 + | YFL063W 0.358257586602122 0
## [5] VI 6421-7570 + | YFL062W 0.557314032688737 1
## [6] VI 831-1368 + | YFL067W 0.826518925838172 0
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
There are additional anchors that account for the strandedness of features,
we can fix the 3’ or 5’ parts of a GRanges object using anchor_3p()
or anchor_5p()
, respectively.
The plyranges and GenomicRanges API also have utilities to modify
genomic regions via standard set operations like taking the complement
(complement_ranges()
) which will find gaps, or finding flanking ranges
(see flank_left()
and friends) or promoters (GenomicRanges::promoters()
).
Exercises
- Create a new GRanges object from CpG islands that stretches the intervals on either side by their width while leaving the centre of the interval fixed.
- Create a new GRanges object from exons that has only non-exonic regions.
- Create a new GRanges object that represent 2bp canonical splice sites on either side of exon (hint: you can do this with anchoring and mutate or with flank).
Summarise annotations and measurements with summarise()
We can compute summarise columns GRanges object using summarise()
. Using
summarise()
is very similiar to mutate()
, we construct new columns
that are summaries of the columns in the input the GRanges.
exons %>%
summarise(
count = n(),
median_width = median(width)
)
## DataFrame with 1 row and 2 columns
## count median_width
## <integer> <numeric>
## 1 459752 133
The main difference is that by summarising our data, we lose the necessary
information required to form a valid GRanges. Instead, summarise()
returns
a DataFrame
(see S4Vectors::DataFrame()
for more information). The
DataFrame
is the Bioconductor version of the base R data.frame. It works
mostly the same way as the base R data.frame
but can also hold data structures
that are common in Bioconductor (like Rle
mentioned above).
Exercises
- What is the average length of CpG islands accross the genome?
- How many distinct transcript identifiers are there in exons?
Summarise over the genome with reduce_ranges()
and disjoin_ranges()
To preserve the GRanges data structure while performing summaries, it is often useful to either merge overlapping ranges (called reducing) or expand ranges by finding the union of the end points or ranges (called disjoining).
To make this more concrete let’s reduce_ranges()
and disjoin_ranges()
on the genes
object we defined above.
# to make things easier to compare
# sort by the start and end coordinates
genes %>%
arrange(start, end)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id gc score
## <Rle> <IRanges> <Rle> | <character> <numeric> <integer>
## [1] VI 836-1363 + | YFL067W 0.826518925838172 0
## [2] VI 1437-2615 - | YFL066C 0.081997754285112 1
## [3] VI 3030-3338 - | YFL065C 0.168389319442213 0
## [4] VI 3322-3846 - | YFL064C 0.672622985672206 0
## [5] VI 5066-5521 + | YFL063W 0.358257586602122 0
## [6] VI 6426-7565 + | YFL062W 0.557314032688737 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# reduce_ranges will merge the 3rd and 4th ranges above
genes %>%
reduce_ranges()
## GRanges object with 5 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] VI 836-1363 *
## [2] VI 1437-2615 *
## [3] VI 3030-3846 *
## [4] VI 5066-5521 *
## [5] VI 6426-7565 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# disjoin ranges will split the 3rd and 4th ranges into
# 3 disjoint ranges.
genes %>%
disjoin_ranges()
## GRanges object with 7 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] VI 836-1363 *
## [2] VI 1437-2615 *
## [3] VI 3030-3321 *
## [4] VI 3322-3338 *
## [5] VI 3339-3846 *
## [6] VI 5066-5521 *
## [7] VI 6426-7565 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Note that by default, reduce_ranges()
and disjoin_ranges()
will discard
any strand information. To preserve them use reduce_ranges_directed()
and
disjoin_ranges_directed()
.
The reduce_ranges()
and disjoin_ranges()
can perform aggregation too.
For example, we can reduce exons and count the number of exons
that were merged together, and concatenate the names of the exons that were
merged.
exons %>%
reduce_ranges(
count = n(),
name = paste(name, collapse = ",")
)
## GRanges object with 229241 ranges and 2 metadata columns:
## seqnames ranges strand | count
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr1 11874-12227 * | 1
## [2] chr1 12613-12721 * | 1
## [3] chr1 13221-14829 * | 2
## [4] chr1 14970-15038 * | 1
## [5] chr1 15796-15947 * | 1
## ... ... ... ... . ...
## [229237] chrY 59337091-59337236 * | 2
## [229238] chrY 59337949-59338150 * | 2
## [229239] chrY 59338754-59338859 * | 2
## [229240] chrY 59340194-59340278 * | 1
## [229241] chrY 59342487-59343488 * | 2
## name
## <character>
## [1] NR_046018_exon_0_0_chr1_11874_f
## [2] NR_046018_exon_1_0_chr1_12613_f
## [3] NR_046018_exon_2_0_chr1_13221_f,NR_024540_exon_0_0_chr1_14362_r
## [4] NR_024540_exon_1_0_chr1_14970_r
## [5] NR_024540_exon_2_0_chr1_15796_r
## ... ...
## [229237] NM_002186_exon_4_0_chrY_59337091_f,NM_176786_exon_5_0_chrY_59337120_f
## [229238] NM_002186_exon_5_0_chrY_59337949_f,NM_176786_exon_6_0_chrY_59337949_f
## [229239] NM_002186_exon_6_0_chrY_59338754_f,NM_176786_exon_7_0_chrY_59338754_f
## [229240] NM_002186_exon_7_0_chrY_59340194_f
## [229241] NM_002186_exon_8_0_chrY_59342487_f,NM_176786_exon_8_0_chrY_59342487_f
## -------
## seqinfo: 93 sequences from hg19 genome
Exercises
- Reduce CpG islands, what is the range that has the maximum number of nearby features?
- Confirm your results in 1. using
filter_by_overlaps()
.
Analyses over groups with group_by()
Any of the above operations can be performed over subsets of the
GRanges
object with group_by()
, this often useful if you want to
filter()
/mutate()
/summarise()
/reduce_ranges()
/disjoin_ranges()
over annotation identifiers or chromosomes.
exons %>%
group_by(seqnames) %>%
summarise(
count = n(),
median_width = median(width)
)
## DataFrame with 49 rows and 3 columns
## seqnames count median_width
## <Rle> <integer> <numeric>
## 1 chr1 43366 137
## 2 chr1_gl000191_random 42 163
## 3 chr1_gl000192_random 46 164
## 4 chr10 19420 131
## 5 chr11 24476 134
## ... ... ... ...
## 45 chrUn_gl000222 20 160
## 46 chrUn_gl000223 22 179.5
## 47 chrUn_gl000228 85 1380
## 48 chrX 18173 133
## 49 chrY 4128 118
Exercises
Let’s read in another BED file that has functional annotations output by chromHMM from ENCODE human embryonic stem cell ChIP-seq data.
hesc_bed <- system.file("extdata", "hesc.chromHmm.bed",
package="HelloRangesData")
hesc <- read_bed(hesc_bed, genome_info = build)
hesc
## GRanges object with 619061 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 10001-10600 * | 15_Repetitive/CNV
## [2] chr1 10601-11137 * | 13_Heterochrom/lo
## [3] chr1 11138-11537 * | 8_Insulator
## [4] chr1 11538-11937 * | 11_Weak_Txn
## [5] chr1 11938-12137 * | 14_Repetitive/CNV
## ... ... ... ... . ...
## [619057] chrX 155256607-155258006 * | 11_Weak_Txn
## [619058] chrX 155258007-155258606 * | 8_Insulator
## [619059] chrX 155258607-155259606 * | 11_Weak_Txn
## [619060] chrX 155259607-155259806 * | 14_Repetitive/CNV
## [619061] chrX 155259807-155260406 * | 15_Repetitive/CNV
## -------
## seqinfo: 93 sequences from hg19 genome
The name column is the predicted state of a range from chromHMM.
Use the hesc GRanges object to answer the following:
- How many ranges are there for each predicted state?
- How many base pairs are represented in the genome for each predicted state? Which state has the maximum number of bases?
Finding overlaps with the join operator
A common operation on two GRanges is to find overlaps and to keep metadata associated with both GRanges objects when an overlap has occurred. We refer to this operation as an overlap join. There are three types of overlap join: inner (A), intersect (B), and left (C) that are shown on the figure below.
knitr::include_graphics(file.path(fig_path, "olaps.png"))
The first type of overlap join is called join_overlap_inner()
, we could
use this to find which CpG islands overlap exons
cpg2 <- join_overlap_inner(cpg, exons)
cpg2
## GRanges object with 45500 ranges and 3 metadata columns:
## seqnames ranges strand | name.x
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 28736-29810 * | CpG:_116
## [2] chr1 135125-135563 * | CpG:_30
## [3] chr1 327791-328229 * | CpG:_29
## [4] chr1 327791-328229 * | CpG:_29
## [5] chr1 327791-328229 * | CpG:_29
## ... ... ... ... . ...
## [45496] chrY 59213795-59214183 * | CpG:_36
## [45497] chrY 59213795-59214183 * | CpG:_36
## [45498] chrY 59213795-59214183 * | CpG:_36
## [45499] chrY 59213795-59214183 * | CpG:_36
## [45500] chrY 59213795-59214183 * | CpG:_36
## name.y score
## <character> <numeric>
## [1] NR_024540_exon_10_0_chr1_29321_r 0
## [2] NR_039983_exon_0_0_chr1_134773_r 0
## [3] NR_028322_exon_2_0_chr1_324439_f 0
## [4] NR_028325_exon_2_0_chr1_324439_f 0
## [5] NR_028327_exon_3_0_chr1_327036_f 0
## ... ... ...
## [45496] NM_001145149_exon_0_0_chrY_59213949_f 0
## [45497] NM_001185183_exon_0_0_chrY_59213949_f 0
## [45498] NM_005638_exon_0_0_chrY_59213949_f 0
## [45499] NR_033714_exon_0_0_chrY_59213949_f 0
## [45500] NR_033715_exon_0_0_chrY_59213949_f 0
## -------
## seqinfo: 93 sequences from hg19 genome
Often we would are also interested in the amount of overlap, we can use
the second type of overlap join join_overlap_intersect()
to do this.
We can modify the above example, to find which CpG islands overlap exons
and how much the CpG island ranges overlap the exon ranges.
cpg3 <- join_overlap_intersect(cpg, exons)
cpg3
## GRanges object with 45500 ranges and 3 metadata columns:
## seqnames ranges strand | name.x
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 29321-29370 * | CpG:_116
## [2] chr1 135125-135563 * | CpG:_30
## [3] chr1 327791-328229 * | CpG:_29
## [4] chr1 327791-328229 * | CpG:_29
## [5] chr1 327791-328229 * | CpG:_29
## ... ... ... ... . ...
## [45496] chrY 59213949-59214117 * | CpG:_36
## [45497] chrY 59213949-59214117 * | CpG:_36
## [45498] chrY 59213949-59214117 * | CpG:_36
## [45499] chrY 59213949-59214117 * | CpG:_36
## [45500] chrY 59213949-59214117 * | CpG:_36
## name.y score
## <character> <numeric>
## [1] NR_024540_exon_10_0_chr1_29321_r 0
## [2] NR_039983_exon_0_0_chr1_134773_r 0
## [3] NR_028322_exon_2_0_chr1_324439_f 0
## [4] NR_028325_exon_2_0_chr1_324439_f 0
## [5] NR_028327_exon_3_0_chr1_327036_f 0
## ... ... ...
## [45496] NM_001145149_exon_0_0_chrY_59213949_f 0
## [45497] NM_001185183_exon_0_0_chrY_59213949_f 0
## [45498] NM_005638_exon_0_0_chrY_59213949_f 0
## [45499] NR_033714_exon_0_0_chrY_59213949_f 0
## [45500] NR_033715_exon_0_0_chrY_59213949_f 0
## -------
## seqinfo: 93 sequences from hg19 genome
Finally, we sometimes would like to know where overlap occurs but also return
any other ranges that are not overlapping. This is this the final type
of overlap join join_overlap_left()
.
cpg4 <- join_overlap_left(cpg, exons)
cpg4
## GRanges object with 55346 ranges and 3 metadata columns:
## seqnames ranges strand | name.x
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 28736-29810 * | CpG:_116
## [2] chr1 135125-135563 * | CpG:_30
## [3] chr1 327791-328229 * | CpG:_29
## [4] chr1 327791-328229 * | CpG:_29
## [5] chr1 327791-328229 * | CpG:_29
## ... ... ... ... . ...
## [55342] chrY 59213795-59214183 * | CpG:_36
## [55343] chrY 59213795-59214183 * | CpG:_36
## [55344] chrY 59213795-59214183 * | CpG:_36
## [55345] chrY 59213795-59214183 * | CpG:_36
## [55346] chrY 59349267-59349574 * | CpG:_29
## name.y score
## <character> <numeric>
## [1] NR_024540_exon_10_0_chr1_29321_r 0
## [2] NR_039983_exon_0_0_chr1_134773_r 0
## [3] NR_028322_exon_2_0_chr1_324439_f 0
## [4] NR_028325_exon_2_0_chr1_324439_f 0
## [5] NR_028327_exon_3_0_chr1_327036_f 0
## ... ... ...
## [55342] NM_001185183_exon_0_0_chrY_59213949_f 0
## [55343] NM_005638_exon_0_0_chrY_59213949_f 0
## [55344] NR_033714_exon_0_0_chrY_59213949_f 0
## [55345] NR_033715_exon_0_0_chrY_59213949_f 0
## [55346] <NA> <NA>
## -------
## seqinfo: 94 sequences from 2 genomes (NA, hg19)
The type of overlap can be modified, by adding suffixes to the function call
for example, to find overlaps that are within we could use
join_overlap_inner_within()
.
Exercises
- Create a new GRanges object, that has exons that are completely within Enhancers elements of hesc. How many exons are there?
- Use
join_overlap_intersect()
to filter exons if at least 50 per cent of their bases are overlapped by enhancer elements. - Count the number of each enhancer element type that are exonic.
There are several ways of doing this, but see if you can come up with a
solution using
join_overlap_left()
+disjoin_ranges()
.
Summarising coverage with compute_coverage()
Often we are interested in counting the number of features over
the genome that overlap each other. In plyranges
this is
done with the compute_coverage()
function in combination with other functions
to manipulate the results. This function will always return a new GRanges
with a column called score which contains the coverage value over a range.
Coverage values
By default, compute_coverage()
returns all coverage scores even if there is
no coverage over a genomic region.
cvg <- exons %>%
compute_coverage()
cvg
## GRanges object with 369095 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr1 1-11873 * | 0
## [2] chr1 11874-12227 * | 1
## [3] chr1 12228-12612 * | 0
## [4] chr1 12613-12721 * | 1
## [5] chr1 12722-13220 * | 0
## ... ... ... ... . ...
## [369091] chrY 59338860-59340193 * | 0
## [369092] chrY 59340194-59340278 * | 1
## [369093] chrY 59340279-59342486 * | 0
## [369094] chrY 59342487-59343488 * | 2
## [369095] chrY 59343489-59373566 * | 0
## -------
## seqinfo: 93 sequences from an unspecified genome
cvg %>% filter(score > 0)
## GRanges object with 190925 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr1 11874-12227 * | 1
## [2] chr1 12613-12721 * | 1
## [3] chr1 13221-14361 * | 1
## [4] chr1 14362-14409 * | 2
## [5] chr1 14410-14829 * | 1
## ... ... ... ... . ...
## [190921] chrY 59337120-59337236 * | 2
## [190922] chrY 59337949-59338150 * | 2
## [190923] chrY 59338754-59338859 * | 2
## [190924] chrY 59340194-59340278 * | 1
## [190925] chrY 59342487-59343488 * | 2
## -------
## seqinfo: 93 sequences from an unspecified genome
Also note that computing coverage will result in a GRanges that is disjoint,
which you can check with isDisjoint()
from GenomicRanges.
Coverage histogram
We can summarise()
the coverage scores as a histogram over all chromosomes
in a genome using group_by()
.
We count the total number of bases (that is, the width of the range) over each chromosome over each coverage score.
cvg_hist <- cvg %>%
group_by(seqnames, score) %>%
summarise(count = sum(width))
We can also compute the coverage histogram over the entire genome by grouping over the scores. Note that we coerce the width to a numeric to avoid integer overflow, when summing.
cvg_hist_all <- cvg %>%
mutate(total_length = as.numeric(width)) %>%
group_by(score) %>%
summarise(
seqnames = "genome",
count = sum(total_length)
)
Then the results can be put together using rbind()
or plotted using
your favourite plotting package.
A more complicated example
Here we perform another example where we find exons with no coverage that overlap cpg islands. To compose a chain first we compute the coverage then use a filter to restrict to zero coverage ranges. Then we perform an inner overlap join to find which cpg islands overlap the exonic regions with zero coverage.
overlaps <- exons %>%
compute_coverage() %>%
filter(score == 0L) %>%
join_overlap_inner(., cpg)
overlaps
## GRanges object with 29354 ranges and 2 metadata columns:
## seqnames ranges strand | score name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 24892-29320 * | 0 CpG:_116
## [2] chr1 29371-34610 * | 0 CpG:_116
## [3] chr1 368598-621095 * | 0 CpG:_84
## [4] chr1 368598-621095 * | 0 CpG:_99
## [5] chr1 368598-621095 * | 0 CpG:_94
## ... ... ... ... . ... ...
## [29350] chrY 27879536-59100456 * | 0 CpG:_32
## [29351] chrY 27879536-59100456 * | 0 CpG:_25
## [29352] chrY 59115124-59213948 * | 0 CpG:_36
## [29353] chrY 59214118-59222126 * | 0 CpG:_36
## [29354] chrY 59343489-59373566 * | 0 CpG:_29
## -------
## seqinfo: 93 sequences from an unspecified genome
We can also compute the coverage histogram of exons over cpg islands and then plot results as an ecdf.
cvg_over_exons <- exons %>%
compute_coverage() %>%
join_overlap_inner(., cpg) %>%
mutate(
total_length = as.numeric(width),
genome_length = sum(total_length)
) %>%
group_by(score, genome_length) %>%
summarise(
count = sum(total_length),
fraction = sum(total_length) / unique(genome_length)
)
cvg_over_exons
## DataFrame with 24 rows and 4 columns
## score genome_length count fraction
## <integer> <numeric> <numeric> <NumericList>
## 1 0 3410479622 3401251108 0.997294071502299
## 2 1 3410479622 6145937 0.00180207410135348
## 3 2 3410479622 1802697 0.000528575801588531
## 4 3 3410479622 649594 0.000190469984283049
## 5 4 3410479622 267055 7.83042356498209e-05
## ... ... ... ... ...
## 20 20 3410479622 132 3.87042336064719e-08
## 21 23 3410479622 7676 2.25070982699453e-06
## 22 24 3410479622 169 4.95531475719224e-08
## 23 25 3410479622 161 4.72074364442574e-08
## 24 77 3410479622 1157 3.39248471838545e-07
plot(1 - cumsum(fraction) ~ score,
data = cvg_over_exons,
xlim = c(0,30),
type= "s",
xlab = "coverage",
ylab = "fraction of bp > coverage")
Learning more
We have touched the surface of what’s possible with plyranges and GenomicRanges. A good starting point to learn more about GenomicRanges is here and plyranges is here.
If you have any questions about the use of these packages, please ask a question on support.bioconductor.org or post an issue on github.
The solutions to the exorcises can be found here.