Parallel High Performance Statistical Bootstrapping in Python

Description

BLB ("Bag of Little Bootstraps") is a method to assess the quality of a statistical estimator based upon subsets of sample distributions. BLB is a variant of and solves the same class of problems as the general bootstrap. Unfortunately, the general bootstrap is a computationally demanding operation when given large data sets, and does not parallelize easily. BLB is an attractive alternative due to its its structural and computational properties which allow for much better parallelization. However, two obstacles exist to realizing this parallelism in practice. First, expressing the parallelism inherent in the algorithm requires quite different code depending on the platform (for example, multi-core with Cilk-aware compiler vs. GPU with CUDA or OpenCL vs. shared-nothing cloud computing with Spark or Hadoop). Second, even given the skeleton code for a particular platform, the specific estimator function being computed is different for each application, making it difficult to encapsulate the BLB pattern in a library. We apply the SEJITS technology (Selective Embedded Just-in-Time Specialization) to solve both problems: scientists can write applications in Python that make use of estimator functions also written in (a subset of) Python, and just-in-time code generation techniques are used to "lower" these functions to efficiency-level languages and inline them into an execution template optimized for each type of parallel platform; in this paper we focus on the multicore environment. The result is that Python applications that use BLB are source- and performance- portable, with performance comparable to hand-written code in low-level languages. We expect that code variants produced for the multicore environment can reasonably support data sets in the order of tens of gigabytes in size, and that variants for the cloud environment can support data sets in the order of terabytes in size. Preliminary results from a simple application of linear regression show a 13.6x speedup as a result of using 16 cores instead of 1 core, and this parallel performance was obtained by simply coding the linear estimator function in Python. Our runtime compiler (specializer) for BLB augments a growing family of such compilers that will ultimately allow Python applications to make use of generic computational patterns across different platforms.