R is a very popular open-source statistical programming language, with lots of interesting features and challenging quirks. R is used in many aspects of data analysis, by people ranging from students in small academic groups to professional engineers in the largest social media companies. This tutorial takes us from the basics of R to advanced features that are particularly interesting to scientists and engineers. We’ll talk about ‘atomic’ vectors, functional and vectorized computations, R’s unique class systems, visualization, extending R to process large data, and literate programming. A lot of material for a couple of hours, but it’ll be fun.

1 Introduction: R and Bioconductor

Today

Using RStudio and Amazon Machine Instances
Quick tour, emphasizing technical rather than practical / domain-specific aspects

https://r-project.org
Statistical (domain-specific) programming language
Interpretted
Extensible via user-contributed modules

Bioconductor

https://bioconductor.org; https://support.bioconductor.org
Collection (> 1000) of R packages for the analysis and comprehension of high-throughput genomic data

Basics of R

‘Atomic’ vectors (raw, logical, character, integer, numeric, complex, list)

1 + 2

## [1] 3

rnorm(10)

##  [1] -0.44434257  2.25297871 -1.91564662 -0.01458099  0.37155174
##  [6] -0.60815538  1.81692434  0.15781449 -1.23618066 -1.74224901

x <- rnorm(1000)
y <- x + rnorm(1000, sd=.5)
plot(y ~ x)

Vector operations are efficient; no need for loops!
In-memory data model, making scalability a challenge

More complicated data structures

## data.frame: rectangular, homogeneous within but not between columns
df <- data.frame(X=x, Y=y)
head(df)

##             X          Y
## 1  1.81963041  1.6082495
## 2 -0.90109341 -1.2308712
## 3  0.80206844  1.0635343
## 4  0.04485260  0.1397213
## 5  0.65105337  0.2969279
## 6 -0.07158132  1.0479738

## 'S3' class; complicated structure managed via API
fit <- lm(Y ~ X, df)
anova(fit)

## Analysis of Variance Table
## 
## Response: Y
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## X           1 1006.78 1006.78  3930.7 < 2.2e-16 ***
## Residuals 998  255.62    0.26                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(Y ~ X, df, asp=1)
abline(fit, lwd=2, col="cyan")

Statistical concepts, e.g.,
- NA: missing values
- factor(): vector of finite levels, e.g., ‘Male’, ‘Female’, ‘Hermaphrodite’
- ~: Formula

Packages: Core and user contributed; usually domain-specific rather than ‘infrastructure’

library(ggplot2)    # part of the 'Hadleyverse'
ggplot(df, aes(x=X, y=Y)) + geom_point() + geom_smooth()

## Warning in grid.Call.graphics(L_polygon, x$x, x$y, index): semi-
## transparency is not supported on this device: reported only once per page

Help
- ?rnorm: man pages with technical details
- Vignettes: literate programming documents illustrating overall use
Simple and advanced statistical analysis intrinsic to R: descriptive statistics, simple and advanced linear models, machine learning, …
- stats::kmeans(): a basic implementation, in a core package
- kernlab::kkmeans(): a kernel-based implementation, in a contributed package – check out the vignette, vignette("kernlab")

2 Literate Programming

R-flavored markdown, .Rmd files

A great way to integrate analysis and presentation
Standard markdown
R code chunks
```
```{r}
x <- rnorm(100)
```
```
‘knit’ from Rmd to md: replace R code chunks with evaluated equivalent, as markdown
Process from md to HTML, PDF, …
Like this document!
Cheat sheet

3 Class systems

Very different from C++ / Java

Informal classes – a list() with a class attribute declaring the list to be a member of a class hierarchy. Single inheritance.
```
me = structure(list(first="Martin", last="Morgan"),
               class="people")
```

‘Generic’ and associated ‘methods’

print         # generic

## function (x, ...) 
## UseMethod("print")
## <bytecode: 0xfef9318>
## <environment: namespace:base>

print.people = function(x, ...)
    cat("who:\n ", paste(x$first, x$last, "\n "))
me

## who:
##   Martin Morgan 
##

R works on vectors, so model columns rather than rows!

team = function(name, first, last)
    structure(list(name=name, first=first, last=last),
              class=c("team", "people"))

print.team = function(x) {
    cat("team name:", x$name,
        "\nteam size:", length(x$first),
        "\n")
    NextMethod()
}

bioc = team("Bioconductor Core Team",
            c("Martin", "Valerie", "Dan", "Herve"),
            c("Morgan", "Obenchain", "Tenenbaum", "Pages"))
bioc

## team name: Bioconductor Core Team 
## team size: 4 
## who:
##   Martin Morgan 
##   Valerie Obenchain 
##   Dan Tenenbaum 
##   Herve Pages 
##

Formal class definition; introspection
Multiple inheritance, multiple dispatch

Much more complicated data structures possible

.People = setClass("People",
    slots=c(first="character", last="character"),
    validity=function(object) {
        tst = character()
        if (length(object@first) != length(object@last))
            tst = c(tst, "'first' and 'last' must have same length")
        ## other validity checks...
        if (length(tst)) tst else NULL
    })

setMethod("show", "People", function(object) {
    cat("who:\n ", paste(object@first, object@last, "\n "))
})

## [1] "show"

.Team = setClass("Team",
    contains="People",
    slots=c(name="character"))

Team = function(name, people) {
    ## constructor, then generate class
    members = strsplit(people, "[[:space:]]+")
    .Team(name=name,
          first=sapply(members, `[[`, 1),
          last=sapply(members, `[[`, 2))
}

setMethod("show", "Team", function(object) {
    cat("team name:", object@name,
        "\nteam size:", length(object@first),
        "\n")
    callNextMethod()
})

## [1] "show"

Team("Bioconductor Core Team",
     c("Martin Morgan", "Valerie Obenchain", "Dan Tenenbaum",
       "Herve Pages"))

## team name: Bioconductor Core Team 
## team size: 4 
## who:
##   Martin Morgan 
##   Valerie Obenchain 
##   Dan Tenenbaum 
##   Herve Pages 
##

Advanced: what about R itself?

Implemented in C

All accessible to the advanced R developer

R.home("include/Rinternals.h")

## [1] "/home/mtmorgan/bin/R-3-2-branch/include/Rinternals.h"

Vector types are S-expressions that wrap metadata around base types

Different types

#define NILSXP       0    /* nil = NULL */
#define SYMSXP       1    /* symbols */
...
#define CLOSXP       3    /* closures */
...
#define LGLSXP      10    /* logical vectors */
#define INTSXP      13    /* integer vectors */
#define REALSXP     14    /* real variables */
#define CPLXSXP     15    /* complex variables */
...

Basic data structure

struct sxpinfo_struct {
    SEXPTYPE type      :  TYPE_BITS;
    ...
    unsigned int named :  2;  /* reference counter - 0, 1, >1
    ...
    unsigned int gcgen :  1;  /* old generation number */
    ...
}; /*           Tot: 32 */

struct vecsxp_struct {
    R_len_t length;
    R_len_t truelength;
};

typedef struct VECTOR_SEXPREC {
    SEXPREC_HEADER;
    struct vecsxp_struct vecsxp;
} VECTOR_SEXPREC, *VECSEXP;

4 Large Data

4.1 Basics

Efficient R code

R is fast…

x <- runif(1000000)
system.time(log(x))

##    user  system elapsed 
##   0.046   0.000   0.046

…but it’s easy to fall off the wagon

library(microbenchmark)
f = function(n) {
    y <- numeric()
    for (i in seq_len(n))
        y[i] = log(x[i])
    y
}
microbenchmark(f(1000), f(10000), f(20000), times=10)

## Unit: milliseconds
##      expr        min         lq       mean     median         uq
##   f(1000)   2.639494   2.716821   3.310221   2.771704   2.972985
##  f(10000) 151.727198 152.250440 227.279112 153.954549 158.203007
##  f(20000) 577.658969 584.645653 693.815064 588.491838 945.794270
##         max neval cld
##    5.559355    10 a  
##  523.111300    10  b 
##  948.209794    10   c

Primary sins
- Iteration, instead of vectorization
- Row-wise data.frame operations instead of column-wise and matrix operations
- ‘Copy-and-append’ (like above) rather than pre-allocated and fill

Algorithms

Naive algorithms are typically polynomial or worse
More sophisticated algorithms are N log N or similar
Example: overlap of ‘query’ ranges with ‘subject’ ranges
- Naive: nested for loops test all combinations
- Nuanced: overlaps between nested containment lists, implemented in IRanges::findOverlaps()
Case Study

4.2 Scalability

Need to consider R’s memory and computational models
Solution:
- Chunk-wise (e.g., millions of rows per chunk) processing
- Manage memory use for in-memory model
- Enable vectorization for efficient computation
- Often requires re-implementation (or use of packages that re-implement, e.g., biglm) of algorithms

4.3 Parallel evaluation

R is single-threaded (!)
Some aspects are easy to parallelize, e.g., parallel BLAS library for linear algebra

Many R programs have map-reduce-like semantics

FUN <- function(i) { Sys.sleep(1); i }
lapply(1:5, FUN)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5

system.time(lapply(1:5, FUN))

##    user  system elapsed 
##   0.000   0.000   5.005

Very easy to parallelize

library(parallel)
system.time({
    mclapply(1:5, FUN, mc.cores=5)
})

##    user  system elapsed 
##   0.009   0.044   1.025

Clusters & clouds

Packages such as BatchJobs and SparkR provide solutions
Difficult for domain experts (i.e., not software engineers) to use
- Docker and other containers
- Pre-configured AMI instances

4.4 ‘Native’ implementation

Rcpp

C++ interface to R data structures
Example: index of top N elements (StackOverflow, Rcpp Gallery)

z <- rnorm(1e6)
top_n <- function(x, n)
    tail(order(x), n)
system.time(top_n(z, 100))

##    user  system elapsed 
##    0.79    0.00    0.79

Rcpp: priority queue (file “top_i_pq.cpp”)

#include <Rcpp.h>
#include <queue>

using namespace Rcpp;
using namespace std;

// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
{
    typedef pair<double, int> Elt;
    priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
    vector<int> result;

    for (int i = 0; i != v.size(); ++i) {
        if (pq.size() < n)
            pq.push(Elt(v[i], i));
        else {
            Elt elt = Elt(v[i], i);
            if (pq.top() < elt) {
                pq.pop();
                pq.push(elt);
            }
        }
    }

    result.reserve(pq.size());
    while (!pq.empty()) {
        result.push_back(pq.top().second + 1);
        pq.pop();
    }

    return wrap(result);
}

Compile (sourceCpp() is a party trick; instead build library and load in R)

sourceCpp("top_i_pq.cpp")
top_i_pq

## function (v, n) 
## .Primitive(".Call")(<pointer: 0x7f501f3e13b0>, v, n)

Run

z <- rnorm(1e6)
res1 <- top_n(z, 100)
res2 <- top_i_pq(z, 100)
identical(res1, res2)

## [1] TRUE

microbenchmark(top_n(z, 100), top_i_pq(z, 100), times=10)

## Unit: milliseconds
##              expr        min         lq       mean     median         uq
##     top_n(z, 100) 683.298091 698.011092 732.242213 714.348213 772.386669
##  top_i_pq(z, 100)   4.712946   5.012883   5.140092   5.083448   5.245147
##         max neval cld
##  823.853972    10   b
##    5.593705    10  a

6 Case Study: Cancer Genome Atlas Gene Expression

Background

Very large federally funded initiative to gain insight into genetics of many different cancers
- One part: levels of gene expression
Recall: genes –> transcribed to mRNA –> translated to protiens
- More mRNA implies higher gene expression
Approach: high-throughput sequencing
- Isolate mRNA
- Reverse transcribed to DNA
- Fragment, and sequence each fragment
- Align fragments to genome
- Count number of fragments overlapping each gene
- Higher count –> higher expression

Upstream

Terabytes of short sequences, aligned to reference genome

Reduction

Parallel processing (in or out of R) to reduce to simple matrix of counts: genes x samples
Represent as a SummarizedExperiment
S4 class

Coordinates counts with row and column annotations, e.g., genomic ranges of each gene

## Coordinates of each known gene
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

## experiment data (subset) -- aligned reads; 1 file per indivdiual
library(RNAseqData.HNRNPC.bam.chr14)
library(Rsamtools)
fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
bfl <- BamFileList(fls, yieldSize=1000000)

## summarize reads overlapping each gene and individual
library(GenomicAlignments)
se <- summarizeOverlaps(genes, bfl)
head(assay(se)[rowSums(assay(se)) > 0,])

##           ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303
## 10001           114       156       129       144       175       213
## 100128927       287       315       300       243       190       160
## 100129075        36        38        35        36        20        32
## 100129794        21        34        33        23         9        16
## 100288846        10        10        11         8         2         3
## 100289511       150       125       174       114        34        23
##           ERR127304 ERR127305
## 10001           210       165
## 100128927       194       218
## 100129075        45        45
## 100129794        23        13
## 100288846         5         1
## 100289511        31        61

Analysis

A more complete sample data set

library(airway)
data(airway)

## the count matrix
head(assay(airway))

##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003        679        448        873        408       1138
## ENSG00000000005          0          0          0          0          0
## ENSG00000000419        467        515        621        365        587
## ENSG00000000457        260        211        263        164        245
## ENSG00000000460         60         55         40         35         78
## ENSG00000000938          0          0          2          0          1
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003       1047        770        572
## ENSG00000000005          0          0          0
## ENSG00000000419        799        417        508
## ENSG00000000457        331        233        229
## ENSG00000000460         63         76         60
## ENSG00000000938          0          0          0

## description of the experiment (columns of the count matrix)
colData(airway)[, 1:3]

## DataFrame with 8 rows and 3 columns
##            SampleName     cell      dex
##              <factor> <factor> <factor>
## SRR1039508 GSM1275862   N61311    untrt
## SRR1039509 GSM1275863   N61311      trt
## SRR1039512 GSM1275866  N052611    untrt
## SRR1039513 GSM1275867  N052611      trt
## SRR1039516 GSM1275870  N080611    untrt
## SRR1039517 GSM1275871  N080611      trt
## SRR1039520 GSM1275874  N061011    untrt
## SRR1039521 GSM1275875  N061011      trt

Exploratory analysis, e.g., clustering to ask whether there are distinct groups, outliers, etc. Multidimensional scaling using euclidean distance between asinh-transformed counts

## transpose assay data (t), asinh transform, euclidean distance
## between samples (dist). Use classical multidimensional scaling for
## dimension reduction
cmd <- cmdscale(dist(asinh(t(assay(airway)))))

## solid circles (pch=20) at 4 times usual size (cex=4); col points based
## on dex treatment (col=airway$dex)
plot(cmd, col=airway$dex, pch=20, cex=4)

Assess each gene for differential expression. Advanced statistical model. Efficient implementation in C.

library(DESeq2)
airway <- airway[rowSums(assay(airway)) > 0,]
## specifiy experimental design
dds <- DESeqDataSet(airway, ~ cell + dex)

## 'normalize' columns for techincal artifacts; fit negative
## binomial model to each gene
dds <- DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

## summarize results
res <- results(dds)
res[head(order(res$padj)),]

## log2 fold change (MAP): dex untrt vs trt 
## Wald test p-value: dex untrt vs trt 
## DataFrame with 6 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE      stat
##                  <numeric>      <numeric> <numeric> <numeric>
## ENSG00000152583   997.4398      -4.316100 0.1724125 -25.03357
## ENSG00000165995   495.0929      -3.188698 0.1277441 -24.96160
## ENSG00000101347 12703.3871      -3.618232 0.1499441 -24.13054
## ENSG00000120129  3409.0294      -2.871326 0.1190334 -24.12202
## ENSG00000189221  2341.7673      -3.230629 0.1373643 -23.51870
## ENSG00000211445 12285.6151      -3.552999 0.1589971 -22.34631
##                        pvalue          padj
##                     <numeric>     <numeric>
## ENSG00000152583 2.636198e-138 4.888829e-134
## ENSG00000165995 1.598068e-137 1.481809e-133
## ENSG00000101347 1.195444e-128 6.808712e-125
## ENSG00000120129 1.468582e-128 6.808712e-125
## ENSG00000189221 2.625986e-122 9.739783e-119
## ENSG00000211445 1.311375e-110 4.053243e-107

plot(log2FoldChange ~ asinh(baseMean), res, pch=".",
     cex=ifelse(res$padj < .001, 3, 1))

Summarize results

Vignettes and workflows created using literate programming principles

Making complicated configurations accessible

R for High-Throuhhput (Genomic) Analysis

Martin Morgan (martin.morgan@roswellpark.org)

29 March 2016

Package version: CDSE2016 0.0.3

Contents