explain my data: 2013

Wednesday, December 11, 2013

NIPS and the Zuckerberg Visit

This past weekend, several hundred researchers, students, and hobbyists streamed into Lake Tahoe to attend a conference called Neural Information Processing Systems. NIPS is one of the two machine learning conferences of note (the other is ICML). Acceptance rates are low; prestige is high. Anyone interested in machine learning, statistics, applied math, or data can come to the conference but do expect to be bombarded by 10,000 terms that you don't know, even if you have a PhD.

When we arrived and cracked open the workshop schedule, we found something very peculiar:

“3:00pm – 3:30pm: Q&A with Mark Zuckerberg”

What is the CEO of Facebook doing speaking at an academic conference on machine learning (and, nominally, neuroscience)? There's obviously a porous boundary between the corporate and academic worlds, but has it ever been this porous?

Ordinarily, when employees of Google, Microsoft, or Facebook show up at NIPS, they either (a) keep to the recruiting room or (b) are there to discuss & present research. The latter group, though they might be wearing their employer's logo on a t-shirt, engage with the conference as academics. Their status is derived from their research accomplishments. And this status does not shut down discourse: they will still field questions and suggestions in-person from any passing student. These kinds of interactions are encouraged at conferences like NIPS. As a professor once told me when I was a graduate student “We need you guys, you’re the lifeblood of new ideas.”

In contrast, consider the presence of Mark Zuckerberg. I'm sure someone saw a legitimate need to encircle his Q&A session with armed guards, but nothing screams hierarchy like police at the door. The tone changed rapidly: accomplished professors became little more than lowly researchers shuffling into the Deep Learning workshop to see a Very Important Person speak. Zuckerberg couldn't help but disrupt the conference; the spectacle drew so many, that an adjacent workshop was paused to make room for the overflow. And equally distasteful is what went on behind the scenes. The conference was full of whispered rumors of one-on-one meetings and secret acquisitions. This is the first academic conference I have attended where there was this much talk about getting rich or being bought out, something that is actually happening to a number of researchers that appeal to Facebook’s ambitions.

As for the content of the Q&A itself? My distrust of excessive power will show itself here (note: Soviet childhood), but I can think of Mark Zuckerberg only as a tunnel visionary. He wants Facebook to connect all the people in the world & have a personalized theory of mind for each user. As far as he sees, this is for the good. Some of the questions asked by the incisive audience were polite versions of “What are the dangers of having this much data about so many people?” and “What does Facebook as a company do to help society?” These Zuckerberg dodged so expertly that by the time he was done “answering” (with a hefty & convincing confidence), I had forgotten exactly what the question was.

Facebook could have easily sent some high-ranking folks to give an interesting & technical talk instead of Zuckerberg coming himself. His presence was jarring because it subverted the spirit of the conference, and injected into it the distinct aroma of big money. Was it anything more than a glamorous & sanctioned recruiting visit? I would have expected the NIPS organizers to decline to endorse such industrial overreach.

The barriers between Silicon Valley and academia are blurry and getting blurrier. Maybe this is to be expected in Zuckerberg's "knowledge economy", where the largest data sets and greatest computational resources are destined to be locked behind corporate doors. However, if academia has any hope maintaining an atmosphere of open inquiry (rather than just proprietary R&D), academics have to protect their culture. Otherwise, the resulting decline in high-quality reproducible research will be a loss for everyone involved, and society at large.

In the future, Mark Zuckerberg should be welcome to attend NIPS just like anyone else, assuming he has paid the appropriate registration fee (or obtained a scholarship). But it is the job of academics (here, the organizers of NIPS) to uphold the necessary boundary between academia and Silicon Valley. They have failed to do so, and I sincerely hope that this flirtation with Silicon Valley won’t turn into a marriage.

This post was written jointly with Alex Rubinsteyn and a bottle of Scotch.

Friday, October 11, 2013

Training Random Forests in Python using the GPU

Random Forests have emerged as a very popular learning algorithm for tackling complex prediction problems. Part of their popularity stems from how remarkably well they work as "black-box" predictors to model nearly arbitrary variable interactions (as opposed to models which are more sensitive to noise and variable scaling). If you want to construct random forests in Python, then the two best options at the moment are scikit-learn (which uses some carefully optimized native code for tree learning) and the slightly speedier closed-source wiseRF.
Both of these implementations use coarse-grained parallelization to split work across multiple cores, but neither of them make can make use of an available graphics card for computation. In fact, aside from an implementation custom-tailored for vision tasks and a few research papers of questionable worth, I haven't been able to find a general-purpose GPU implementation of Random Forests at all. So, for the past few months I've worked with Yisheng Liao to see if we could dramatically lower the training time of a Random Forest by off-loading as much computation to the GPU as possible.
We made a GPU Random Forest library for Python called CudaTree, which, for recent generation NVIDIA GPUs, can be 2x - 6x faster than scikit-learn.
CudaTree is available on PyPI and can be installed by running pip install cudatree. It's written for Python 2.7 and depends on NumPy, PyCUDA (to compile and launch CUDA kernels), and Parakeet (to accelerate CPU-side computations).

Benchmarks

We trained scikit-learn Random Forests (with fifty trees) on four medium-sized datasets (the largest takes up ~500mb of memory) on a 6-core Xeon E5-2630. We can then compare this baseline with training times for CudaTree on machines with a variety of NVIDIA graphics processors:

Tesla C1060 (GT200, 240 cores)
Tesla C2075 (Fermi, 448 cores)
Tesla K20x (Kepler, 2,688 cores)
Titan (Kepler, 2,688 cores)

Dataset	scikit-learn	CudaTree (C1060)	CudaTree (C2075)	CudaTree (K20x)	CudaTree (Titan)
CIFAR-10 (raw)	114s	52s	40s	24s	20s 5.7x faster
CIFAR-100	800s	707s	308s	162s	136s 5.8x faster
ImageNet subset	115s	80s	60s	35s	28s 4.1x faster
covertype	138s	-	86s	-	50s 2.7x faster

edit: Thanks to everyone who nudged me to validate the accuracy to CudaTree's generated models, there was actually a bug which resulted in the GPU code stopping early on the covertype data. We had unit tests for training error but none for held-out data. Noob mistake. The latest version of CudaTree fixes this mistake and the performance impact is negligible on all the other datasets. I'll add new covertype timings once the machines we used are free.

Information about the datasets used above:

Name	Features	Samples	Classes	Description
CIFAR-10	3,072	10,000	10	Raw pixel values to classify images of objects.
CIFAR-100	3,072	50,000	100	Same as CIFAR-10, but with more samples and more labels.
ImageNet subset	4,096	10,000	10	Random subset of 10 labels from the 1000 category ImageNet data set, processed by the convolutional filters of a trained convolutional neural network (amazingly attains same accuracy!)
covertype	54	581,012	7	Identify tree cover from a given set of 57 domain-specific features.

Limitations

CudaTree currently only implements a RandomForestClassifier, though regression trees are forthcoming. Furthermore, CudaTree's performance degrades when the number of target labels exceeds a few hundred and it might stop working altogether when the number of labels creeps into the thousands.

Also, there are some datasets which are too large for use with CudaTree. Not only does your data have to fit in comparatively smaller GPU memory, it has to contend with CudaTree's auxiliary data structures, which are proportional in size to your data. The exact specification for the largest dataset you can use is a little hairy, but in general try not to exceed about half the size of your GPU's memory.

If your data & task fit within these restrictions, then please give CudaTree a spin and let us know how it works for you.

Due credit: Yisheng did most of the coding, debugging, and profiling. So, you know, the actual hard work. I got to sit back and play advisor/"ideas man", which was a lot of fun. Also, the Tesla K20 used for this research was kindly donated by the NVIDIA Corporation, thanks NVIDIA!