SVT_SparseArray objects

The SVT_SparseArray class is a new container for efficient in-memory representation of multidimensional sparse arrays. It uses the SVT layout to represent the nonzero multidimensional data internally.

An SVT_SparseMatrix object is an SVT_SparseArray object of 2 dimensions.

Note that SVT_SparseArray and SVT_SparseMatrix objects replace the older and less efficient COO_SparseArray and COO_SparseMatrix objects.

Usage

## Constructor function:
SVT_SparseArray(x, dim=NULL, dimnames=NULL, type=NA)

Arguments

x

If dim is NULL (the default) then x must be an ordinary matrix or array, or a dgCMatrix/lgCMatrix object, or any matrix-like or array-like object that supports coercion to SVT_SparseArray.

If dim is provided then x can either be missing, a vector (atomic or list), or an array-like object. If missing, then an allzero SVT_SparseArray object will be constructed. Otherwise x will be used to fill the returned object (length(x) must be <= prod(dim)). Note that if x is an array-like object then its dimensions are ignored i.e. it's treated as a vector.

dim

NULL or the dimensions (supplied as an integer vector) of the SVT_SparseArray or SVT_SparseMatrix object to construct. If NULL (the default) then the returned object will have the dimensions of matrix-like or array-like object x.

dimnames

The dimnames of the object to construct. Must be NULL or a list of length the number of dimensions. Each list element must be either NULL or a character vector along the corresponding dimension. If both dim and dimnames are NULL (the default) then the returned object will have the dimnames of matrix-like or array-like object x.

type

A single string specifying the requested type of the object.

Normally, the SVT_SparseArray object returned by the constructor function has the same type() as x but the user can use the type argument to request a different type. Note that doing:

    svt <- SVT_SparseArray(x, type=type)

is equivalent to doing:

    svt <- SVT_SparseArray(x)
    type(svt) <- type

but the former is more convenient and will generally be more efficient.

Supported types are all R atomic types plus "list".

Details

SVT_SparseArray is a concrete subclass of the SparseArray virtual class. This makes SVT_SparseArray objects SparseArray derivatives.

The nonzero data in a SVT_SparseArray object is stored in a Sparse Vector Tree. We'll refer to this internal data representation as the SVT layout. See the "SVT layout" section below for more information.

The SVT layout is similar to the CSC layout (compressed, sparse, column-oriented format) used by CsparseMatrix derivatives from the Matrix package, like dgCMatrix or lgCMatrix objects, but with the following improvements:

The SVT layout supports sparse arrays of arbitrary dimensions.
With the SVT layout, the sparse data can be of any type. Whereas CsparseMatrix derivatives only support sparse data of type "double" or "logical" at the moment.
The SVT layout imposes no limit on the number of nonzero elements that can be stored. With dgCMatrix/lgCMatrix objects, this number must be < 2^31.
Overall, the SVT layout allows more efficient operations on SVT_SparseArray objects.

Value

An SVT_SparseArray or SVT_SparseMatrix object.

SVT layout

An SVT (Sparse Vector Tree) is a tree of depth N - 1 where N is the number of dimensions of the sparse array.

The leaves in the tree can only be of two kinds: NULL or leaf vector. Leaves that are leaf vectors can only be found at the deepest level in the tree (i.e. at depth N - 1). All leaves found at a lower depth must be NULLs.

A leaf vector represents a sparse vector along the first dimension (a.k.a. innermost or fastest moving dimension) of the sparse array. It contains a collection of offset/value pairs sorted by strictly ascending offset. More precisely, a leaf vector is represented by an ordinary list of 2 parallel dense vectors:

nzvals: a vector (atomic or list) of nonzero values (zeros are not allowed);
nzoffs: an integer vector of offsets (i.e. 0-based positions).

The 1st vector determines the type of the leaf vector i.e. "double", "integer", "logical", etc... All the leaf vectors in the SVT must have the same type as the sparse array.

It's useful to realize that a leaf vector simply represents a 1D SVT.

In SparseArray 1.5.4 a new type of leaf vector was introduced called lacunar leaf. A lacunar leaf is a non-empty leaf vector where the nzvals component is set to NULL. In this case the nonzero values are implicit: they're all considered to be equal to one.

Examples:

An SVT_SparseArray object with 1 dimension has its nonzero data stored in an SVT of depth 0. Such SVT is represented by a single leaf vector.
An SVT_SparseArray object with 2 dimensions has its nonzero data stored in an SVT of depth 1. Such SVT is represented by a list of length the extend of the 2nd dimension (number of columns). Each list element is an SVT of depth 0 (as described above), or a NULL if the corresponding column is empty (i.e. has no nonzero data).

For example, the nonzero data of an 8-column sparse matrix will be stored in an SVT that looks like this:
```
    .------------------list-of-length-8-----------------.
   /       /       /      |       |      \       \       \
  |       |       |       |       |       |       |       |
 leaf    leaf    NULL    leaf    leaf    leaf    leaf    NULL
vector  vector          vector  vector  vector  vector
```
The NULL leaves represent the empty columns (i.e. the columns with no nonzero elements).
An SVT_SparseArray object with 3 dimensions has its nonzero data stored in an SVT of depth 2. Such SVT is represented by a list of length the extend of the 3rd dimension. Each list element must be an SVT of depth 1 (as described above) that stores the nonzero data of the corresponding 2D slice, or a NULL if the 2D slice is empty (i.e. has no nonzero data).
And so on...

Examples

## ---------------------------------------------------------------------
## BASIC CONSTRUCTION
## ---------------------------------------------------------------------
SVT_SparseArray(dim=5:3)  # allzero object
#> <5 x 4 x 3 SparseArray> of type "logical" [nzcount=0 (0%)]:
#> ,,1
#>       [,1]  [,2]  [,3]  [,4]
#> [1,] FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE
#> [5,] FALSE FALSE FALSE FALSE
#> 
#> ,,2
#>       [,1]  [,2]  [,3]  [,4]
#> [1,] FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE
#> [5,] FALSE FALSE FALSE FALSE
#> 
#> ,,3
#>       [,1]  [,2]  [,3]  [,4]
#> [1,] FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE
#> [5,] FALSE FALSE FALSE FALSE
#> 

SVT_SparseArray(dim=c(35000, 2e6), type="raw")  # allzero object
#> <35000 x 2000000 SparseMatrix> of type "raw" [nzcount=0 (0%)]:
#>                [,1]       [,2]       [,3] ... [,1999999] [,2000000]
#>     [1,]         00         00         00   .         00         00
#>     [2,]         00         00         00   .         00         00
#>     [3,]         00         00         00   .         00         00
#>     [4,]         00         00         00   .         00         00
#>     [5,]         00         00         00   .         00         00
#>      ...          .          .          .   .          .          .
#> [34996,]         00         00         00   .         00         00
#> [34997,]         00         00         00   .         00         00
#> [34998,]         00         00         00   .         00         00
#> [34999,]         00         00         00   .         00         00
#> [35000,]         00         00         00   .         00         00

## Use a dgCMatrix object to fill the SVT_SparseArray object to construct:
x <- rsparsematrix(10, 16, density=0.1)  # random dgCMatrix object
SVT_SparseArray(x, dim=c(8, 5, 4))
#> <8 x 5 x 4 SparseArray> of type "double" [nzcount=16 (10%)]:
#> ,,1
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.00 0.00 0.00 0.99 0.00
#> [2,] 0.00 0.00 0.00 0.00 0.00
#>  ...    .    .    .    .    .
#> [7,]  0.0  0.0  0.0  0.0  0.0
#> [8,]  0.2  0.0  0.0  0.0  0.0
#> 
#> ...
#> 
#> ,,4
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]  0.0  0.0  0.0  0.0  0.0
#> [2,]  0.0  0.0  0.0  0.0  0.2
#>  ...    .    .    .    .    .
#> [7,]    0    0    0    0    0
#> [8,]    0    0    0    0    0
#> 

svt1 <- SVT_SparseArray(dim=c(12, 5, 2))  # allzero object
svt1[cbind(11, 2:5, 2)] <- 22:25
svt1
#> <12 x 5 x 2 SparseArray> of type "integer" [nzcount=4 (3.3%)]:
#> ,,1
#>       [,1] [,2] [,3] [,4] [,5]
#>  [1,]    0    0    0    0    0
#>  [2,]    0    0    0    0    0
#>   ...    .    .    .    .    .
#> [11,]    0    0    0    0    0
#> [12,]    0    0    0    0    0
#> 
#> ,,2
#>       [,1] [,2] [,3] [,4] [,5]
#>  [1,]    0    0    0    0    0
#>  [2,]    0    0    0    0    0
#>   ...    .    .    .    .    .
#> [11,]    0   22   23   24   25
#> [12,]    0    0    0    0    0
#> 

svt2 <- SVT_SparseArray(dim=c(6, 4), type="integer",
                        dimnames=list(letters[1:6], LETTERS[1:4]))
svt2[c(1:2, 8, 10, 15:17, 24)] <- (1:8)*10L
svt2
#> <6 x 4 SparseMatrix> of type "integer" [nzcount=8 (33%)]:
#>    A  B  C  D
#> a 10  0  0  0
#> b 20 30  0  0
#> c  0  0 50  0
#> d  0 40 60  0
#> e  0  0 70  0
#> f  0  0  0 80

## ---------------------------------------------------------------------
## CSC (Compressed Sparse Column) LAYOUT VS SVT LAYOUT
## ---------------------------------------------------------------------

## dgCMatrix objects from the Matrix package use the CSC layout:
dgcm2 <- as(svt2, "dgCMatrix")
dgcm2@x  # nonzero values
#> [1] 10 20 30 40 50 60 70 80
dgcm2@i  # row indices of the nonzero values
#> [1] 0 1 1 3 2 3 4 5
dgcm2@p  # breakpoints (0 followed by one breakpoint per column)
#> [1] 0 2 4 7 8

str(svt2)
#> Formal class 'SVT_SparseMatrix' [package "SparseArray"] with 5 slots
#>   ..@ type        : chr "integer"
#>   ..@ SVT         :List of 4
#>   .. ..$ :List of 2
#>   .. .. ..$ : int [1:2] 10 20
#>   .. .. ..$ : int [1:2] 0 1
#>   .. ..$ :List of 2
#>   .. .. ..$ : int [1:2] 30 40
#>   .. .. ..$ : int [1:2] 1 3
#>   .. ..$ :List of 2
#>   .. .. ..$ : int [1:3] 50 60 70
#>   .. .. ..$ : int [1:3] 2 3 4
#>   .. ..$ :List of 2
#>   .. .. ..$ : int 80
#>   .. .. ..$ : int 5
#>   ..@ .svt_version: int 1
#>   ..@ dim         : int [1:2] 6 4
#>   ..@ dimnames    :List of 2
#>   .. ..$ : chr [1:6] "a" "b" "c" "d" ...
#>   .. ..$ : chr [1:4] "A" "B" "C" "D"

m3 <- matrix(rpois(54e6, lambda=0.4), ncol=1200)

## Note that 'SparseArray(m3)' can also be used for this:
svt3 <- SVT_SparseArray(m3)
svt3
#> <45000 x 1200 SparseMatrix> of type "integer" [nzcount=17805192 (33%)]:
#>             [,1]    [,2]    [,3]    [,4] ... [,1197] [,1198] [,1199] [,1200]
#>     [1,]       0       0       0       1   .       0       0       0       1
#>     [2,]       0       0       0       1   .       0       0       0       0
#>     [3,]       0       0       0       0   .       0       0       0       0
#>     [4,]       0       0       1       1   .       0       0       0       1
#>     [5,]       0       0       0       1   .       0       0       0       0
#>      ...       .       .       .       .   .       .       .       .       .
#> [44996,]       0       0       0       0   .       0       0       1       2
#> [44997,]       1       2       0       0   .       0       0       1       0
#> [44998,]       0       0       0       0   .       0       0       0       1
#> [44999,]       0       0       0       1   .       1       1       1       0
#> [45000,]       0       1       1       1   .       0       0       1       1

dgcm3 <- as(m3, "dgCMatrix")

## Compare type and memory footprint:
type(svt3)
#> [1] "integer"
object.size(svt3)
#> 142649336 bytes
type(dgcm3)
#> [1] "double"
object.size(dgcm3)
#> 213668608 bytes

## Transpose:
system.time(svt <- t(t(svt3)))
#>    user  system elapsed 
#>   0.584   0.080   0.663 
system.time(dgcm <- t(t(dgcm3)))
#>    user  system elapsed 
#>   0.222   0.035   0.257 
identical(svt, svt3)
#> [1] TRUE
identical(dgcm, dgcm3)
#> [1] TRUE

## rbind():
m4 <- matrix(rpois(45e6, lambda=0.4), ncol=1200)
svt4 <- SVT_SparseArray(m4)
dgcm4 <- as(m4, "dgCMatrix")

system.time(rbind(svt3, svt4))
#>    user  system elapsed 
#>   0.210   0.078   0.288 
system.time(rbind(dgcm3, dgcm4))
#>    user  system elapsed 
#>   0.088   0.036   0.124