explain my data: 2014

Spark is the user-friendly face of Big Data: a distributing programming framework which lets you write collection oriented algorithms in Scala that (theoretically) execute seamlessly across many machines. Spark has an elegant API (parallel collections with methods like map/reduce/groupByKey) that feels like programming locally. Unlike MapReduce, Spark can cache partial results across the memory of its distributed workers, allowing for significantly faster/lower-latency computations. If Spark worked as promised it would be a huge productivity boost over writing MapReduce pipelines.

Unfortunately, as I've learned over the past month, Spark will happily generate programs that mysteriously grind to a halt-- and tracking down the source of these problems can be a numbingly opaque process. There are at least two distinct problems: Spark's lazy evaluation makes it hard to know which parts of your program are the bottleneck and, even if you can identify a particularly slow expression, it's not always obvious why it's slow or how to make it faster.

Lazy Evaluation

Spark's API lets you to express your intentions clearly via many fine-grained calls to transformations such as map, filter, flatMap, distinct, &c. If you ran a long chain of transformations one at a time, you'd incur a large communication overhead and clog each worker's local cache with useless partial results. Spark reconciles its functional style with performance via delayed execution: transformations get bundled together and only run on demand. The unruly down-side to Spark's execution model is that big swaths of your program will run as a monolithic glob of computation. And if that computation runs slowly...well, good luck figuring out which of its constituent parts is the culprit. Spark could ameliorate some of this confusion with a non-lazy debug mode or some built-in tooling assistance (i.e. a distributed profiler). For now, however, you're stuck (1) trying to reason your way through the lazy dependency graph ("Oh! This one is a shuffle dependency!") (2) forcing computations and checking how long they take.

Too Many Parameters, Too Many Ways to Fail

Though Spark looks like you're programming on your laptop, it has many performance gotchas you must guard vigilantly against. Make sure your objects aren't using Java serialization (Kryo is faster), pull that object out of the closure, broadcast this array to keep it from being copied repeatedly. Though annoying for beginners, these rules of thumb at least have some consistency which can be learned.

More frustrating, however, is the inevitability with which Spark's many (initially invisible) default parameters will be wrong for your application. Buffers and heap sizes will turn out to be too small. Your reductions will use too few (or too many) workers. Too much memory gets used for data caching, or maybe it's too little. There's no rigorous or systematic way to set these parameters: you wait until things fail and supplicate at the feet of Spark's heuristic knob pantheon.

My Experience Thus Far

Nothing I do can save my cluster from spending hours thrashing its way through a modest input size (before dying under a flood of obscure exceptions). I have recently tried all of the following (and more) to keep Spark from its stubborn predilection toward dying:

increasing JVM heap size of Spark's executors
tweaking the "memory fraction" used for caching partial results
writing custom Kryo serializers for objects stored in parallel collections
co-partitioning collections before joining them
changing the Akka frame size (receive buffer of the underlying actor library)
turning speculative execution on/off
turning shuffle file consolidation on/off
changing defaultParallelism (a confusingly named parameter which sometimes affects the number of workers assigned to a parallel reduction)
using stripped down structures (with fewer fields) to represent collection elements
manually interning strings (by transforming them into integers) in hopes of speeding up groupByKey
broadcasting shared objects (in case serialized closures were getting too heavy)
changing the HDFS block size of the input data (to increase initial number of data partitions)
writing a custom partitioner to try uniform binning of keys (in case RangePartitioner was slow)
writing a custom RDD (parallel collection) for 3-way join
...&c &c.

After a month of "stab in the dark" debugging, I've learned a lot about the internals of Spark but still don't have a working application. In some weird twist of Stockholm syndrome, I've come to like expressing my algorithms in Spark's abstractions: if only they would actually run!

Has anyone else had a similar experience with Spark? Alternatively, has anyone had a positive experience with a non-trivial Spark codebase (bigger than tutorial wordcount examples). If so, what were your tricks to avoid the death-by-a-thousand-shuffles I've been experiencing? How about Spark's close cousins: Scalding, Scoobi, and Scrunch, are they substantially better (or worse)?

Until recently, wiseRF was the obviously fastest Random Forest implementation for Python (and thus, the best library for dealing with larger in-memory datasets). Though scikit-learn has had tree ensembles for the past several years, their performance was typically at least an order of magnitude worse than wiseRF (a boon to wiseRF's marketing team). The sklearn developers seemed to shake off their tree-building sluggishness with a Cython rewrite in the 0.14 release.

Unfortunately, as Yisheng and I discovered while working on CudaTree, even the faster Cython tree builder can still be significantly slower than wiseRF. Why is there still a performance gap when both libraries now use native implementations? wiseRF is probably doing something smarter with their choice of algorithms and/or data layout but the iron curtain of closed source software keeps us from finding out exactly what's going on.

It turns out that one important choice for building trees efficiently is the algorithm used to sort candidate splitting thresholds. The upcoming 0.15 release of scikit-learn will include some cache-friendly changes to how their algorithm sorts data. These modifications seem to have finally closed the gap with wiseRF.

Below are the benchmark times from the CudaTree paper, with the current branch of scikit-learn included under the label scikit-learn 0.15. The takeaway is that the new release will build Random Forests 2x-6x faster than the old one and that the performance differences between scikit-learn, wiseRF, and CudaTree are not significant.

Training times for 100 trees grown on a 6-core Xeon E5-2630 machine with an NVIDIA Titan graphics card:

Dataset	wiseRF 1.5.11	scikit-learn 0.14	scikit-learn 0.15	CudaTree 0.6
ImageNet subset	23s	50s	13s	25s
CIFAR-100 (raw)	160s	502s	181s	197s
covertype	107s	463s	73s	67s
poker	117s	415s	99s	59s
PAMAP2	1,066s	7,630s	1,683s	934s
intrusion	667s	1,528s	241s	199s

Information about the datasets used above:

Name	Features	Samples	Classes	Description
ImageNet subset	4,096	10,000	10	Random subset of 10 labels from the 1000 category ImageNet data set, processed by the convolutional filters of a trained convolutional neural network (amazingly attains same accuracy!)
CIFAR-100	3,072	50k	100	Same as CIFAR-10, but with more samples and more labels.
covertype	57	581k	7	Identify tree cover from domain-specific features.
poker	11	1M	10	Poker hands
PAMAP2	52	2.87M	13	Physical activity monitoring
intrusion	41	5M	24	Network intrusion

explain my data

Wednesday, May 28, 2014

Spark should be better than MapReduce (if only it worked)

Lazy Evaluation

Too Many Parameters, Too Many Ways to Fail

My Experience Thus Far

Wednesday, March 5, 2014

Big speedup for training Random Forests in scikit-learn 0.15

Total Pageviews