Both of these implementations use coarse-grained parallelization to split work across multiple cores, but neither of them make can make use of an available graphics card for computation. In fact, aside from an implementation custom-tailored for vision tasks and a few research papers of questionable worth, I haven't been able to find a general-purpose GPU implementation of Random Forests at all. So, for the past few months I've worked with Yisheng Liao to see if we could dramatically lower the training time of a Random Forest by off-loading as much computation to the GPU as possible.
We made a GPU Random Forest library for Python called CudaTree, which, for recent generation NVIDIA GPUs, can be 2x - 6x faster than scikit-learn.
CudaTree is available on PyPI and can be installed by running
pip install cudatree
. It's written for Python 2.7 and depends on NumPy, PyCUDA (to compile and launch CUDA kernels), and Parakeet (to accelerate CPU-side computations).
Benchmarks
We trained scikit-learn Random Forests (with fifty trees) on four medium-sized datasets (the largest takes up ~500mb of memory) on a 6-core Xeon E5-2630. We can then compare this baseline with training times for CudaTree on machines with a variety of NVIDIA graphics processors:- Tesla C1060 (GT200, 240 cores)
- Tesla C2075 (Fermi, 448 cores)
- Tesla K20x (Kepler, 2,688 cores)
- Titan (Kepler, 2,688 cores)
Dataset | scikit-learn | CudaTree (C1060) | CudaTree (C2075) | CudaTree (K20x) | CudaTree (Titan) |
---|---|---|---|---|---|
CIFAR-10 (raw) | 114s | 52s | 40s | 24s | 20s 5.7x faster |
CIFAR-100 | 800s | 707s | 308s | 162s | 136s 5.8x faster |
ImageNet subset | 115s | 80s | 60s | 35s | 28s 4.1x faster |
covertype | 138s | - | 86s | - | 50s 2.7x faster |
edit: Thanks to everyone who nudged me to validate the accuracy to CudaTree's generated models, there was actually a bug which resulted in the GPU code stopping early on the covertype data. We had unit tests for training error but none for held-out data. Noob mistake. The latest version of CudaTree fixes this mistake and the performance impact is negligible on all the other datasets. I'll add new covertype timings once the machines we used are free.
Information about the datasets used above:
Name | Features | Samples | Classes | Description |
---|---|---|---|---|
CIFAR-10 | 3,072 | 10,000 | 10 | Raw pixel values to classify images of objects. |
CIFAR-100 | 3,072 | 50,000 | 100 | Same as CIFAR-10, but with more samples and more labels. |
ImageNet subset | 4,096 | 10,000 | 10 | Random subset of 10 labels from the 1000 category ImageNet data set, processed by the convolutional filters of a trained convolutional neural network (amazingly attains same accuracy!) |
covertype | 54 | 581,012 | 7 | Identify tree cover from a given set of 57 domain-specific features. |
Limitations
CudaTree currently only implements a RandomForestClassifier, though regression trees are forthcoming. Furthermore, CudaTree's performance degrades when the number of target labels exceeds a few hundred and it might stop working altogether when the number of labels creeps into the thousands.Also, there are some datasets which are too large for use with CudaTree. Not only does your data have to fit in comparatively smaller GPU memory, it has to contend with CudaTree's auxiliary data structures, which are proportional in size to your data. The exact specification for the largest dataset you can use is a little hairy, but in general try not to exceed about half the size of your GPU's memory.
If your data & task fit within these restrictions, then please give CudaTree a spin and let us know how it works for you.
Due credit: Yisheng did most of the coding, debugging, and profiling. So, you know, the actual hard work. I got to sit back and play advisor/"ideas man", which was a lot of fun. Also, the Tesla K20 used for this research was kindly donated by the NVIDIA Corporation, thanks NVIDIA!
Interesting post. Could you please provide some test set predictive accuracy of the trained models?
ReplyDeleteWhile I was checking accuracy I found an early-stopping bug which causes gives us lower accuracy on covertye (http://nbviewer.ipython.org/6968728). The latest version of CudaTree fixes this problem (and presents a constructor with options like sklearn).
DeleteNow the accuracy is (on average) very close to sklearn. The only runtime/performance impact is on covertype, and I have to wait a few days to regenerate those timings until the GPUs we used are liberated from their convolutional overlords.
Example scores from 4 different training runs on half of covtype (with the other half as test), using 21 trees:
sklearn score 0.73087647071
cudatree score 0.734704274611
sklearn score 0.736150027882
cudatree score 0.740900360061
sklearn score 0.737134517015
cudatree score 0.737000268497
sklearn score 0.739368550047
cudatree score 0.734983098456
This comment has been removed by the author.
DeleteThis might be apples and oranges, but with multiboost (multiboost.org) trained on 15K training instances, Hamming trees with 70 nodes, 21 iterations takes 191s (single CPU), accuracy 76%. With cross-validated stopping, after 18300 iterations we get to 85.16%, it takes 46 hours.
DeleteIs there a reason you are not providing WiseRF figures? Nitpick: it is scikit-learn (for some time now ;)
ReplyDeleteThanks for pointing out the name change, I updated the text. Also, I don't have wiseRF since trying to get an academic license from their website gave me cryptic SQL errors and I'm easily discouraged. What's your impression of the speed gap between sklearn 0.14's trees and wiseRF?
DeleteWhen I tested it wiseRF was about 20% faster.
DeleteWe've been testing against wiseRF for our paper in the upcoming Big Learning workshop and it's surprised us with how much faster it can be compared to sklearn. Sometimes they're on par and other times wiseRF is 10x faster. A lot of it seems to be data dependent. Which data sets did you try?
DeleteI too found that runtime is often data dependent. Can you share with me the list of datasets you tested on, I'd like to investigate that. thx
DeleteHey Peter,
DeleteWe'll have a paper at the Big Learning workshop at NIPS with performance on the KDD '99 intrusion data, http://archive.ics.uci.edu/ml/datasets/covertype, http://archive.ics.uci.edu/ml/datasets/Poker+Hand, http://www.cs.toronto.edu/~kriz/cifar.html, the first 10 categories of http://www.image-net.org/challenges/LSVRC/2013/#data, and some smaller datasets.
BTW Alex, have you tried to implement a pure Python & Parakeet version of RandomForestClassifier? It should be run close to the speed of sklearn's Cython code with simpler code don't you think?
ReplyDeleteNo, but I think this is a great idea. Maybe I'll add it to python-benchmarks?
DeleteSilly question, but do you think it would have chance of running on a raspberry pi's GPU? My understanding is that while no where near a full PC's GPU, it is still decent sized because it has to handle HD decoding.
ReplyDeleteThis code definitely wouldn't, since we're using CUDA and the Raspberry Pi (as far as I can tell) doesn't have an NVIDIA GPU. Do you know if the GPU they're using has any capability for running general purpose programs?
DeleteMany thanks for your work. I am going to give the library a try!
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteCan you suggest me how to obtain wiseRF 1.5. I am having wiseRF 1.1 through anaconda, but I could not find any documentation for this version.
ReplyDeleteCurrently i am doing classification process for reviews using naivebase. I have a set of reviews as a training set and can you explain me how to create training data in to random forest.
ReplyDelete