Abstract

“Running a large number of methods on several data sets, as in a benchmark comparisons can be computationaly expensive, possibly requiring hours to days to complete if each method is run in serial. Parallelization in the SummarizedBenchmark, both across methods and data sets can be accomplished using the BiocParallel package. Both forms of parallelization are demonstrated with an example case study. `SummarizedBenchmark package version: 2.3.4”

Introduction

The simple examples considered in most of these vignettes were constructed to be computational manageable with only one core. However, when working with larger data sets, running each method in serial with a single machine is often undesirable. Similarly, when replicating benchmark experiments across multiple data sets, running each experiment in serial may be inefficent. In this section, we describe how to use the BiocStyle::Biocpkg("BiocParallel") package to parallelize benchmarking across methods and data sets. More details on how to specify the parallelization back-end can be found in the Introduction to BiocParallel vignette for the BiocParallel package.

Example Case Study

library("SummarizedBenchmark")
library("magrittr")

To demonstrate the use of updateBench(), we use the sample example of comparing methods for differential expression using the SummarizedBenchmark: Full Case Study vignette. The BenchDesign is initialized using RNA-seq counts and truth included with the package as an example data set. The data is described in more detail in the Case Study vignette.

library("limma")
library("edgeR")
library("DESeq2")
library("tximport")

data(soneson2016)

mycounts <- round(txi$counts)
mycoldat <- data.frame(condition = factor(rep(c(1, 2), each = 3)))
rownames(mycoldat) <- colnames(mycounts)
mydat <- list(coldat = mycoldat, cntdat = mycounts,
              status = truthdat$status, lfc = truthdat$logFC)

bd <- BenchDesign(data = mydat)

Three methods for differential expression testing implemented in the DESeq2, edgeR, and limma packages are added to the benchmark. Only the p-values are stored for each method, as before.

Across Methods

Since constructing a BenchDesign object requires no computation, the bottleneck only appears at the buildBench() step of the process. Parallelization of this step is enabled using the BiocParallel package. By default, parallel evaluation is disabled, but can easily be enabled by setting parallel = TRUE and optionally specifying the BPPARAM = parameter. If BPPARAM = is not specified, the default back-end will be used. The default back-end can be checked with bpparam().

Parallelization of buildBench() is carried out across the set of methods specified with addMethod(). Thus, there is no benefit to specifying more cores than the number of methods.

## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK
## class: SummarizedBenchmark 
## dim: 15677 3 
## metadata(1): sessions
## assays(1): default
## rownames: NULL
## rowData names(1): default
## colnames(3): deseq2 edgeR voom
## colData names(9): func.pkg func.pkg.vers ... param.group
##   session.idx

We also run buildBench() without parallelization for comparison.

sb <- buildBench(bd, truthCols = "status")

The results, as expected, are the same as when buildBench() was called without parallelization.

all(assay(sbp) == assay(sb), na.rm = TRUE)
## [1] TRUE

Across Datasets

Typically, benchmark studies have more than a single dataset. For this cases, users can create a BenchDesign object and execute this design on many datasets. Again, running the step of executing the benchmark design on every dataset using a single core might take to long. However, parallelization across datasets is possible using the BiocParallel package.

To demonstrate this, we split the count data and ground truths of the (Soneson and Robinson 2016) dataset, as if they were three different datasets.

ndat <- length(mydat$status)
spIndexes <- split(seq_len(ndat), rep( 1:3, length.out = ndat))

datasetList <- lapply(spIndexes, function(x) {
    list(coldat = mydat$coldat, 
        cntdat = mydat$cntdat[x, ], 
        status = mydat$status[x], 
        lfc = mydat$lfc[x])
})

names(datasetList) <- c("dataset1", "dataset2", "dataset3")

Then, using a call to bplapply() function, we can execute the BenchDesign object for each dataset. Note that with the BPPARAM = parameter, the execution of the BenchDesign is done across many computing instances. In the example below, we use 3 cores.

## $dataset1
## class: SummarizedBenchmark 
## dim: 5226 3 
## metadata(1): sessions
## assays(1): default
## rownames: NULL
## rowData names(1): default
## colnames(3): deseq2 edgeR voom
## colData names(9): func.pkg func.pkg.vers ... param.group
##   session.idx
## 
## $dataset2
## class: SummarizedBenchmark 
## dim: 5226 3 
## metadata(1): sessions
## assays(1): default
## rownames: NULL
## rowData names(1): default
## colnames(3): deseq2 edgeR voom
## colData names(9): func.pkg func.pkg.vers ... param.group
##   session.idx
## 
## $dataset3
## class: SummarizedBenchmark 
## dim: 5225 3 
## metadata(1): sessions
## assays(1): default
## rownames: NULL
## rowData names(1): default
## colnames(3): deseq2 edgeR voom
## colData names(9): func.pkg func.pkg.vers ... param.group
##   session.idx

References

Soneson, Charlotte, and Mark D Robinson. 2016. “iCOBRA: Open, Reproducible, Standardized and Live Method Benchmarking.” Nature Methods 13 (4):283–83. https://doi.org/10.1038/nmeth.3805.