VcfStack and RangedVcfStack Objects
VcfStack-class.Rd
The VcfStack
class is a vector of related VCF files, for instance
each file representing a separate chromosome. The class helps manage these
files as a group. The RangedVcfStack
class extends VcfStack
by
associating genomic ranges of interest to the collection of VCF files.
Constructor
VcfStack(files=NULL, seqinfo=NULL, colData=NULL, index=TRUE, check=TRUE)
Creates a VcfStack object.files
A VcfFilelist object. If a VcfFile or character vector is given a VcfFileList will be coerced. The character vector should be files paths pointing to VCF files. The character vector must be named, with names correspond to seqnames in each VCF file.
seqinfo
A Seqinfo object describing the levels genome and circularity of each sequence.
colData
An optional DataFrame describing each sample in the VcfStack. When present, row names must correspond to sample names in the VCF file.
index
A logical indicating if the vcf index files should be created.
check
A logical indicating if the check across samples should be performed
RangedVcfStack(vs=NULL, rowRanges=NULL)
Creates a RangedVcfStack object.vs
A
VcfStack
object.rowRanges
An optional GRanges object associating the genomic ranges of interest to the collection of VCF files. The seqnames of
rowRanges
are a subset ofseqnames(vs)
. If missing, a default is created from theseqinfo
object of the providedVcfStack
.
Accessors
In the code below, x
is a VcfStack or RangedVcfStack object.
- dim(x)
Get the number of files and samples in the
VcfStack
object.- colnames(x, do.NULL=TRUE, prefix="col")
Get the sample names in the
VcfStack
.- rownames(x), do.NULL=TRUE, prefix="row")
Get the names of the files in
VcfStack
.- dimnames(x))
Get the names of samples and the names of files in
VcfStack
.- files(x, ...), files(x, ..., check=TRUE) <- value
Get or set the files on
x
.value
can be a named character() of file paths or a VcfFileList. The return value will be a VcfFileList.- seqinfo(x), seqinfo(x, new2old = NULL, pruning.mode = c("error", "coarse", "fine", "tidy")) <- value
Get or set the seqinfo on
x
. See seqinfo<- for details onnew2old
andpruning.mode
.- seqlevelsStyle(x) <- value
Set the seqlevels according to the supplied style. File names and rowRanges will also be updated if applicable. See seqlevelsStyle<- for more details.
- colData(x), colData(x, ...) <- value
Get or set the
colData
onx
.value
is a DataFrame.- rowRanges(x), rowRanges(x, ...) <- value
Get or set the
rowRanges
onx
.x
has to be aRangedVcfStack
object.value
is a GRanges.
Methods
In the code below, x
is a VcfStack or RangedVcfStack
object. i
is a GRanges object,
character() vector of seqnames,
numeric() vector, logical() vector, or can be missing. For a
RangedVcfStack object, assay and readVcfStack will use the associated
rowRanges
object for i
.
- vcfFields(x)
Returns a
CharacterList
of all available VCF fields, with names offixed
,info
,geno
andsamples
indicating the four categories. Each element is a character() vector of available VCF field names within each category.- assay(x, i, ..., BPPARAM=bpparam())
Get matrix of genotype calls from the VCF files. See genotypeToSnpMatrix. Argument
i
specifies which files to read.BPPARAM
is the argument to the bpmapply.- readVcfStack(x, i, j=colnames(x), param=ScanVcfParam())
Get content of VCF files in the VcfStack.
i
indicates which files to read.j
can be missing or a character() vector of sample names (see samples) present in the VCF files.param
is a ScanVcfParam object. Ifparam
is usedi
andj
are ignored.- show(object)
Display abbreviated information about
VcfStack
orRangedVcfStack
object.
Subsetting
In the code below, x
is a VcfStack or RangedVcfStack
object.
- x[i, j]
Get elements from ranges
i
and samplesj
as a VcfStack or RangedVcfStack object. Note: for aRangedVcfStack
, therowRanges
object will also be subset.i
can be missing, a character() vector of seqnames, numeric() vector of indexes, logical() orGRanges
object. Wheni
is aGRanges
object,seqnames(i)
is then used to subset the files in the VcfStack.j
can be missing, a character() vector of sample names, a numeric(), logical() vector.
Helpers
- getVCFPath(vs, chrtok)
Deprecated. Use
files(vs)[chrtok]
instead.- paths1kg(chrtoks)
Translate seqnames
chrtoks
to 1000 genomes genotype VCF urls.
Author
Lori Shepherd mailto:Lori.Shepherd@RoswellPark.org and Martin Morgan mailto:Martin.Morgan@RoswellPark.org
Examples
## ---------------------------------------------------------------------
## CONSTRUCTION
## ---------------------------------------------------------------------
## point to VCF files and add names corresponding to the sequence
## present in the file
extdata <- system.file(package="GenomicFiles", "extdata")
files <- dir(extdata, pattern="^CEUtrio.*bgz$", full=TRUE)
names(files) <- sub(".*_([0-9XY]+).*", "\\1", basename(files))
## input data.frame describing the length of each sequence, coerce to
## 'Seqinfo' object
seqinfo <- as(readRDS(file.path(extdata, "seqinfo.rds")), "Seqinfo")
stack <- VcfStack(files, seqinfo)
stack
#> VcfStack object with 7 files and 3 samples
#> Seqinfo object with 25 sequences from hg19 genome
#> use 'readVcfStack()' to extract VariantAnnotation VCF.
## Use seqinfo from VCF files instead of explict value
stack2 <- VcfStack(files)
rstack <- RangedVcfStack(stack)
gr <- GRanges(c("7:1-159138000", "X:1-155270560"))
rstack2 <- RangedVcfStack(stack, gr)
rstack2
#> VcfStack object with 7 files and 3 samples
#> GRanges object with 2 ranges and 0 metadata columns
#> Seqinfo object with 25 sequences from hg19 genome
#> use 'readVcfStack()' to extract VariantAnnotation VCF.
## ---------------------------------------------------------------------
## ACCESSORS
## ---------------------------------------------------------------------
dim(stack)
#> [1] 7 3
colnames(stack)
#> [1] "NA12878" "NA12891" "NA12892"
rownames(stack)
#> [1] "11" "20" "21" "22" "7" "X" "Y"
dimnames(stack)
#> [[1]]
#> [1] "11" "20" "21" "22" "7" "X" "Y"
#>
#> [[2]]
#> [1] "NA12878" "NA12891" "NA12892"
#>
head(files(stack))
#> VcfFileList of length 6
#> names(6): 11 20 21 22 7 X
seqinfo(stack)
#> Seqinfo object with 25 sequences from hg19 genome:
#> seqnames seqlengths isCircular genome
#> 1 249250621 <NA> hg19
#> 2 243199373 <NA> hg19
#> 3 198022430 <NA> hg19
#> 4 191154276 <NA> hg19
#> 5 180915260 <NA> hg19
#> ... ... ... ...
#> 21 48129895 <NA> hg19
#> 22 51304566 <NA> hg19
#> X 155270560 <NA> hg19
#> Y 59373566 <NA> hg19
#> MT 16569 <NA> hg19
colData(stack)
#> DataFrame with 3 rows and 0 columns
## ---------------------------------------------------------------------
## METHODS
## ---------------------------------------------------------------------
readVcfStack(stack, i=GRanges("20:862167-62858306"))
#> class: CollapsedVCF
#> dim: 317 3
#> rowRanges(vcf):
#> GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
#> info(vcf):
#> DataFrame with 26 columns: AC, AF, AN, BaseQRankSum, CCC, ClippingRankSum,...
#> info(header(vcf)):
#> Number Type Description
#> AC A Integer Allele count in genotypes, for each ALT al...
#> AF A Float Allele Frequency, for each ALT allele, in ...
#> AN 1 Integer Total number of alleles in called genotypes
#> BaseQRankSum 1 Float Z-score from Wilcoxon rank sum test of Alt...
#> CCC 1 Integer Number of called chromosomes
#> ClippingRankSum 1 Float Z-score From Wilcoxon rank sum test of Alt...
#> DB 0 Flag dbSNP Membership
#> DP 1 Integer Approximate read depth; some reads may hav...
#> DS 0 Flag Were any of the samples downsampled?
#> END 1 Integer Stop position of the interval
#> FS 1 Float Phred-scaled p-value using Fisher's exact ...
#> GQ_MEAN 1 Float Mean of all GQ values
#> GQ_STDDEV 1 Float Standard deviation of all GQ values
#> HWP 1 Float P value from test of Hardy Weinberg Equili...
#> HaplotypeScore 1 Float Consistency of the site with at most two s...
#> InbreedingCoeff 1 Float Inbreeding coefficient as estimated from t...
#> MLEAC A Integer Maximum likelihood expectation (MLE) for t...
#> MLEAF A Float Maximum likelihood expectation (MLE) for t...
#> MQ 1 Float RMS Mapping Quality
#> MQ0 1 Integer Total Mapping Quality Zero Reads
#> MQRankSum 1 Float Z-score From Wilcoxon rank sum test of Alt...
#> NCC 1 Integer Number of no-called samples
#> QD 1 Float Variant Confidence/Quality by Depth
#> ReadPosRankSum 1 Float Z-score from Wilcoxon rank sum test of Alt...
#> SOR 1 Float Symmetric Odds Ratio of 2x2 contingency ta...
#> set 1 String Source VCF for the merged record in Combin...
#> geno(vcf):
#> List of length 9: GT, AD, DP, GQ, MIN_DP, PGT, PID, PL, SB
#> geno(header(vcf)):
#> Number Type Description
#> GT 1 String Genotype
#> AD . Integer Allelic depths for the ref and alt alleles in the o...
#> DP 1 Integer Approximate read depth (reads with MQ=255 or with b...
#> GQ 1 Integer Genotype Quality
#> MIN_DP 1 Integer Minimum DP observed within the GVCF block
#> PGT 1 String Physical phasing haplotype information, describing ...
#> PID 1 String Physical phasing ID information, where each unique ...
#> PL G Integer Normalized, Phred-scaled likelihoods for genotypes ...
#> SB 4 Integer Per-sample component statistics which comprise the ...
i <- GRanges(c("20:862167-62858306", "7:1-159138000"))
readVcfStack(stack, i=i, j="NA12891")
#> class: CollapsedVCF
#> dim: 318 1
#> rowRanges(vcf):
#> GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
#> info(vcf):
#> DataFrame with 26 columns: AC, AF, AN, BaseQRankSum, CCC, ClippingRankSum,...
#> info(header(vcf)):
#> Number Type Description
#> AC A Integer Allele count in genotypes, for each ALT al...
#> AF A Float Allele Frequency, for each ALT allele, in ...
#> AN 1 Integer Total number of alleles in called genotypes
#> BaseQRankSum 1 Float Z-score from Wilcoxon rank sum test of Alt...
#> CCC 1 Integer Number of called chromosomes
#> ClippingRankSum 1 Float Z-score From Wilcoxon rank sum test of Alt...
#> DB 0 Flag dbSNP Membership
#> DP 1 Integer Approximate read depth; some reads may hav...
#> DS 0 Flag Were any of the samples downsampled?
#> END 1 Integer Stop position of the interval
#> FS 1 Float Phred-scaled p-value using Fisher's exact ...
#> GQ_MEAN 1 Float Mean of all GQ values
#> GQ_STDDEV 1 Float Standard deviation of all GQ values
#> HWP 1 Float P value from test of Hardy Weinberg Equili...
#> HaplotypeScore 1 Float Consistency of the site with at most two s...
#> InbreedingCoeff 1 Float Inbreeding coefficient as estimated from t...
#> MLEAC A Integer Maximum likelihood expectation (MLE) for t...
#> MLEAF A Float Maximum likelihood expectation (MLE) for t...
#> MQ 1 Float RMS Mapping Quality
#> MQ0 1 Integer Total Mapping Quality Zero Reads
#> MQRankSum 1 Float Z-score From Wilcoxon rank sum test of Alt...
#> NCC 1 Integer Number of no-called samples
#> QD 1 Float Variant Confidence/Quality by Depth
#> ReadPosRankSum 1 Float Z-score from Wilcoxon rank sum test of Alt...
#> SOR 1 Float Symmetric Odds Ratio of 2x2 contingency ta...
#> set 1 String Source VCF for the merged record in Combin...
#> geno(vcf):
#> List of length 9: GT, AD, DP, GQ, MIN_DP, PGT, PID, PL, SB
#> geno(header(vcf)):
#> Number Type Description
#> GT 1 String Genotype
#> AD . Integer Allelic depths for the ref and alt alleles in the o...
#> DP 1 Integer Approximate read depth (reads with MQ=255 or with b...
#> GQ 1 Integer Genotype Quality
#> MIN_DP 1 Integer Minimum DP observed within the GVCF block
#> PGT 1 String Physical phasing haplotype information, describing ...
#> PID 1 String Physical phasing ID information, where each unique ...
#> PL G Integer Normalized, Phred-scaled likelihoods for genotypes ...
#> SB 4 Integer Per-sample component statistics which comprise the ...
head(assay(stack, gr))
#> NA12878 NA12891 NA12892
#> rs760352870 1 0 0
#> X:193438_G/C 2 2 NA
#> X:386831_A/G 2 NA NA
#> X:672818_T/C NA 2 NA
#> X:1784428_C/G 0 0 2
#> X:1861132_C/T 2 NA 0
head(assay(rstack2))
#> NA12878 NA12891 NA12892
#> rs760352870 1 0 0
#> X:193438_G/C 2 2 NA
#> X:386831_A/G 2 NA NA
#> X:672818_T/C NA 2 NA
#> X:1784428_C/G 0 0 2
#> X:1861132_C/T 2 NA 0
seqlevels(stack2)
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
#> [16] "16" "17" "18" "19" "20" "21" "22" "X" "Y" "MT"
rownames(stack2)
#> [1] "11" "20" "21" "22" "7" "X" "Y"
seqlevelsStyle(stack2)
#> [1] "NCBI" "Ensembl"
seqlevelsStyle(stack2) <- "UCSC"
seqlevelsStyle(stack2)
#> [1] "UCSC"
seqlevels(stack2)
#> [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
#> [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
#> [19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY" "chrM"
rownames(stack2)
#> [1] "chr11" "chr20" "chr21" "chr22" "chr7" "chrX" "chrY"
vcfFields(stack2)
#> CharacterList of length 4
#> [["fixed"]] REF ALT QUAL FILTER
#> [["info"]] AC AF AN BaseQRankSum CCC ... NCC QD ReadPosRankSum SOR set
#> [["geno"]] GT AD DP GQ MIN_DP PGT PID PL SB
#> [["samples"]] NA12878 NA12891 NA12892
## ---------------------------------------------------------------------
## SUBSETTING
## ---------------------------------------------------------------------
## select rows 4, 5, 6 and samples 1, 2
stack[4:6, 1:2]
#> VcfStack object with 3 files and 2 samples
#> Seqinfo object with 25 sequences from hg19 genome
#> use 'readVcfStack()' to extract VariantAnnotation VCF.
## select rownames "7", "11" and sample "NA12891"
stack[c("7", "11"), "NA12891"]
#> VcfStack object with 2 files and 1 samples
#> Seqinfo object with 25 sequences from hg19 genome
#> use 'readVcfStack()' to extract VariantAnnotation VCF.
stack[c("7", "11", "X"), 2:3]
#> VcfStack object with 3 files and 2 samples
#> Seqinfo object with 25 sequences from hg19 genome
#> use 'readVcfStack()' to extract VariantAnnotation VCF.
## subset with GRanges
stack[GRanges("20:862167-62858306")]
#> VcfStack object with 1 files and 3 samples
#> Seqinfo object with 25 sequences from hg19 genome
#> use 'readVcfStack()' to extract VariantAnnotation VCF.
rstack2[]
#> VcfStack object with 7 files and 3 samples
#> GRanges object with 2 ranges and 0 metadata columns
#> Seqinfo object with 25 sequences from hg19 genome
#> use 'readVcfStack()' to extract VariantAnnotation VCF.
rstack2[,1]
#> VcfStack object with 7 files and 1 samples
#> GRanges object with 2 ranges and 0 metadata columns
#> Seqinfo object with 25 sequences from hg19 genome
#> use 'readVcfStack()' to extract VariantAnnotation VCF.
## ---------------------------------------------------------------------
## HELPERS
## ---------------------------------------------------------------------
paths1kg(1:3)
#> chr1
#> "http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
#> chr2
#> "http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
#> chr3
#> "http://1000genomes.s3.amazonaws.com/release/20130502/ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"