Friday, October 11, 2013

Training Random Forests in Python using the GPU

Random Forests have emerged as a very popular learning algorithm for tackling complex prediction problems. Part of their popularity stems from how remarkably well they work as "black-box" predictors to model nearly arbitrary variable interactions (as opposed to models which are more sensitive to noise and variable scaling). If you want to construct random forests in Python, then the two best options at the moment are scikit-learn (which uses some carefully optimized native code for tree learning) and the slightly speedier closed-source wiseRF.
Both of these implementations use coarse-grained parallelization to split work across multiple cores, but neither of them make can make use of an available graphics card for computation. In fact, aside from an implementation custom-tailored for vision tasks and a few research papers of questionable worth, I haven't been able to find a general-purpose GPU implementation of Random Forests at all. So, for the past few months I've worked with Yisheng Liao to see if we could dramatically lower the training time of a Random Forest by off-loading as much computation to the GPU as possible.
We made a GPU Random Forest library for Python called CudaTree, which, for recent generation NVIDIA GPUs, can be 2x - 6x faster than scikit-learn.
CudaTree is available on PyPI and can be installed by running pip install cudatree. It's written for Python 2.7 and depends on NumPy, PyCUDA (to compile and launch CUDA kernels), and Parakeet (to accelerate CPU-side computations).

Benchmarks

We trained scikit-learn Random Forests (with fifty trees) on four medium-sized datasets (the largest takes up ~500mb of memory) on a 6-core Xeon E5-2630. We can then compare this baseline with training times for CudaTree on machines with a variety of NVIDIA graphics processors:

Dataset scikit-learn CudaTree (C1060) CudaTree (C2075) CudaTree (K20x) CudaTree (Titan)
CIFAR-10 (raw) 114s 52s 40s 24s 20s
5.7x faster
CIFAR-100 800s 707s 308s 162s 136s
5.8x faster
ImageNet subset 115s 80s 60s 35s 28s
4.1x faster
covertype 138s - 86s - 50s
2.7x faster

edit: Thanks to everyone who nudged me to validate the accuracy to CudaTree's generated models, there was actually a bug which resulted in the GPU code stopping early on the covertype data. We had unit tests for training error but none for held-out data. Noob mistake. The latest version of CudaTree fixes this mistake and the performance impact is negligible on all the other datasets. I'll add new covertype timings once the machines we used are free.

Information about the datasets used above:

Name Features Samples Classes Description
CIFAR-10 3,072 10,000 10 Raw pixel values to classify images of objects.
CIFAR-100 3,072 50,000 100 Same as CIFAR-10, but with more samples and more labels.
ImageNet subset 4,096 10,000 10 Random subset of 10 labels from the 1000 category ImageNet data set, processed by the convolutional filters of a trained convolutional neural network (amazingly attains same accuracy!)
covertype 54 581,012 7 Identify tree cover from a given set of 57 domain-specific features.

Limitations

CudaTree currently only implements a RandomForestClassifier, though regression trees are forthcoming. Furthermore, CudaTree's performance degrades when the number of target labels exceeds a few hundred and it might stop working altogether when the number of labels creeps into the thousands.

Also, there are some datasets which are too large for use with CudaTree. Not only does your data have to fit in comparatively smaller GPU memory, it has to contend with CudaTree's auxiliary data structures, which are proportional in size to your data. The exact specification for the largest dataset you can use is a little hairy, but in general try not to exceed about half the size of your GPU's memory.

If your data & task fit within these restrictions, then please give CudaTree a spin and let us know how it works for you.

Due credit: Yisheng did most of the coding, debugging, and profiling. So, you know, the actual hard work. I got to sit back and play advisor/"ideas man", which was a lot of fun. Also, the Tesla K20 used for this research was kindly donated by the NVIDIA Corporation, thanks NVIDIA!