Contents

1 Abstract

Abstract: Technological advances such as large scale single cell transcriptome profiling have exploded in recent years and enabled unprecedented insight into the behaviour of individual cells. In particular, Single cell RNA-Sequencing (scRNA-Seq) technology allows for cell-type specific characterization of gene expression values, towards understanding underlying biological processes. Concerted examination of multiple collections of scRNA-Seq data promises further biological insights that cannot be uncovered with individual datasets. However, such integrative analyses are challenging and require sophisticated methodologies.

To enable effective interrogation of multiple scRNA-Seq datasets, we have developed a novel algorithm, named scMerge, that removes unwanted variation by identifying stably expressed genes and utilizing pseudo-replicates across datasets. Biological knowledge such as cell type information can be easily incorporated into scMerge to further improve performance. We compared scMerge with four popular and recent batch correction methods using seven publicly available scRNA-Seq data collections that cover different tissues, species and protocols. We found that scMerge effectively removed batch and dataset-specific effects across a wide range of biological systems, which demonstrates that scMerge performs well in multiple scenarios and enhances biological discovery, including inferring cell developmental trajectories.

1.1 Software Availability

The scMerge R package is available at https://sydneybiox.github.io/scMerge.