Wednesday, March 5, 2014

Big speedup for training Random Forests in scikit-learn 0.15

Until recently, wiseRF was the obviously fastest Random Forest implementation for Python (and thus, the best library for dealing with larger in-memory datasets). Though scikit-learn has had tree ensembles for the past several years, their performance was typically at least an order of magnitude worse than wiseRF (a boon to wiseRF's marketing team). The sklearn developers seemed to shake off their tree-building sluggishness with a Cython rewrite in the 0.14 release.

Unfortunately, as Yisheng and I discovered while working on CudaTree, even the faster Cython tree builder can still be significantly slower than wiseRF. Why is there still a performance gap when both libraries now use native implementations? wiseRF is probably doing something smarter with their choice of algorithms and/or data layout but the iron curtain of closed source software keeps us from finding out exactly what's going on.

It turns out that one important choice for building trees efficiently is the algorithm used to sort candidate splitting thresholds. The upcoming 0.15 release of scikit-learn will include some cache-friendly changes to how their algorithm sorts data. These modifications seem to have finally closed the gap with wiseRF.

Below are the benchmark times from the CudaTree paper, with the current branch of scikit-learn included under the label scikit-learn 0.15. The takeaway is that the new release will build Random Forests 2x-6x faster than the old one and that the performance differences between scikit-learn, wiseRF, and CudaTree are not significant.

Training times for 100 trees grown on a 6-core Xeon E5-2630 machine with an NVIDIA Titan graphics card:

Dataset wiseRF 1.5.11 scikit-learn 0.14 scikit-learn 0.15 CudaTree 0.6
ImageNet subset 23s 50s 13s 25s
CIFAR-100 (raw) 160s 502s 181s 197s
covertype 107s 463s 73s 67s
poker 117s 415s 99s 59s
PAMAP2 1,066s 7,630s 1,683s 934s
intrusion 667s 1,528s 241s 199s

Information about the datasets used above:

Name Features Samples Classes Description
ImageNet subset 4,096 10,000 10 Random subset of 10 labels from the 1000 category ImageNet data set, processed by the convolutional filters of a trained convolutional neural network (amazingly attains same accuracy!)
CIFAR-100 3,072 50k 100 Same as CIFAR-10, but with more samples and more labels.
covertype 57 581k 7 Identify tree cover from domain-specific features.
poker 11 1M 10 Poker hands
PAMAP2 52 2.87M 13 Physical activity monitoring
intrusion 41 5M 24 Network intrusion