explain my data: July 2012

Sunday, July 22, 2012

Expensive lessons in Python performance tuning

Over the past year I've done a lot of number crunching in Python with the wonderful ipython/numpy/pandas/sklearn stack and using picloud to parallelize my computations. Picloud is an amazing service which lets you run your code in Amazon's behemoth data centers while hiding most of the configuration headache. Their computational model is essentially a parallel map: you submit a list of inputs and specify what function to map over them and picloud automagically launches that function on multiple machines. This is less powerful than full-blown MapReduce (no parallel reductions) but, luckily, a lot of the computations machine learning people want to perform are trivial to parallelize. These include:

parameter searches (parallelize over distinct parameters)
k-fold cross validation (parallelize over the folds)
training random forests (trees don't need to know about each other)
random starts for locally converging hill climbers like k-means, EM, etc..

As long as you have multiple things to do in parallel and the work for each item greatly exceeds the communication time, then picloud can be a miracle. However, they do charge a premium over the standard AWS instance prices and even Amazon's baseline prices can add up quickly if you're using powerful instance types. When you're running on someone else's hardware the inefficiencies of your code translate very directly into added costs. One month, I managed to rack up an extremely high bill from an embarrassingly small amount of actual work because (1) my Python code created millions of bloated objects in memory, forcing me to use the more expensive high-memory instance type, and (2) I had overlooked some very simple optimizations that later cut my runtime by an order of magnitude.

So, I thought I might save someone else time and money by sharing some tips which made my Python code more efficient without sacrificing flexibility or maintainability.

Profile everything! Profile mercilessly and relentlessly. Let no assumption about efficiency pass without being put under the cold and uncaring gaze of a profiler. This applies to both runtime (use ipython's prun command) and memory consumption (which you can inspect with the over-engineered guppy). Don't "optimize" a damn thing until you know for sure it's a top bottleneck.
Be suspicious of loops (but have faith in Python's core operations). If your bottleneck is a loop performing numerical computations you can almost always speed it up by using some equivalent vectorized NumPy operations. On the other hand, once the obviously vectorizable stuff is off-loaded to library code, then I'm often surprised just how efficient Python's operations can be. I can build lists of startling length, index into dictionaries all day long, construct millions of simple objects, and it all adds up to a small fraction of my total runtime.
Simplest is often fastest. I spent a whole day concocting elaborate buffering and pre-fetching schemes to traverse the lines of a file as quickly as possible without reading it all into memory. I neglected, however, to compare with the trivial solution of simply writing "for line in file:", which of course turned out to be faster and more efficient. Similarly, I spent a while tweaking a CSV parsers before finding that using "string.split" with the standard parsing routines int and float was just as fast.
If you're creating millions of objects, make them namedtuples. In one feature extraction pipeline, I take a multi-million line CSV and parse each line into a Python object. My ever-frustrated internal micro-optimizer screams at the obvious inefficiency. And, sadly, he's right. Python objects are extremely heavy-weight compared to the simple struct we would create in C. Luckily, Guido and Friends foresaw the "zillions of simple objects" use-case and thus created namedtuples. They have named fields like objects, but are laid out contiguously and take up a lot less space. Alternatively, if you're using an object for more than just data storage and want to use non-trivial methods, you can lay out a full-blown object contiguously by filling out the slots attribute. How efficient are they? As a totally unscientific demonstration I constructed a million little vectors with field names "x", "y", and "z". When implemented as ordinary Python objects they took up 400MB of memory, whereas their namedtuple cousins only required 114MB. Still not nearly as compact as they could be, but a dramatic improvement nonetheless!
Don't search linearly through a sorted array, use binary search (duh). On the one hand, this is CS 101 kid's stuff. On the other, I didn't know about the bisect library and had code littered with np.nonzero(sorted<=x)[0]. Once I switched to using bisect_left/bisect_right I saw a huge performance improvement. edit: In the comments, Peter pointed out that NumPy implements a faster binary search called searchsorted; you (and I) should probably use that instead!.
Numpy's "fancy indexing" is slow. If you index into a NumPy array with a range then you get a very lightweight view over the original array; this kind of indexing doesn't copy any data and is almost free. If, however, you index with a boolean vector or a sequence of integers ("fancy indexing"), then you actually incur a significant cost. If your code does a lot of fancy indexing then you should either switch to using the take method or rewrite your indexing as an explicit loop in native code (see #8).
Wes McKinney is a genius. If you're implementing anything Wes McKinney has already put in his library pandas, just stop. His code is faster, more robust, and more likely to be correct than anything you're about to write. Want rolling window aggregators? Use pandas. Need to handle missing data? Use pandas. Are you writing some sort of unbelievably ugly hack which attempts to implement joins and group-by's over NumPy arrays but actually manages to spend 3 hours computing a subtly incorrect result? (I have done this). Jesus Christ, just stop and use pandas.
When all else fails, try weave. Yes, weave is unmaintained, ugly, and hard to debug. But, if you find yourself thinking "if only I could just inline 5 lines of C right in the middle of this Python code", then weave lets you actually do that. If, however, that 5 lines turns into 50 then maybe it's time to learn about Cython.

Also, a few last words specific to machine learning: climb the complexity ladder carefully. It's tempting to throw a fancy tool at some new data, and doubly tempting to try every possible combination of preprocessing steps and hyperparameters. Brute force search and a muscular learning algorithm let you sidestep the need for thinking, experimentation, or understanding. Of course, in my experience, this rarely works out and ends up costing quite a bit of money.

So, do the more prudent mature thing and start simple. Collect simple statistics, plot each feature, and try to understand the data a little before you launch 100,000 jobs on picloud. Even then, start with simpler learners like SGD Logistic Regression or small Random Forests. These will let you explore your choices of preprocessing steps and target functions without eating up too many resources. Once you're more comfortable with the problem, go ahead and train a 1000 layer deep belief network (if that's your thing).

Postscript: After almost a year, this article seems out of date with regard to the various ways you can generate native code. Particularly, before you go reaching for weave, Cython, or (gasp!) even C itself, it might be worthwhile to try selective just-in-time compilation. Both Numba and my own Parakeet will generate efficient native implementations for array-oriented programs. Copperhead is a little more limited semantically (it's really a subset of Haskell masquerading as Python), but if you can squeeze your algorithm into Copperhead's model then you'll get a blazing fast GPU kernel as a reward. Alternatively, you can skip the "pretending to be Python" charade and build data-parallel expression trees directly using Theano. It can be a little rough around the edges but, like Copperhead, if you can get it working then the (many orders of magnitude faster) GPU kernel performance is well worth the trouble.

Wednesday, July 11, 2012

Should you apply PCA to your data?

If you've ever dipped your toe into the cold & murky pool of data processing, you've probably heard of principal component analysis (PCA). PCA is a classy way to reduce the dimensionality of your data, while (purportedly) keeping most of the information. It's ubiquitously well-regarded.

But is it actually a good idea in practice? Should you apply PCA to your data before, for example, learning a classifier? This post will take a small step in the direction of answering this question.

I was inspired to investigate PCA by David MacKay's amusing response to an Amazon review lamenting PCA's absence in MacKay's book:

"Principal Component Analysis" is a dimensionally invalid method that gives people a delusion that they are doing something useful with their data. If you change the units that one of the variables is measured in, it will change all the "principal components"! It's for that reason that I made no mention of PCA in my book. I am not a slavish conformist, regurgitating whatever other people think should be taught. I think before I teach. David J C MacKay.

Ha! He's right, of course. Snarky, but right. The results of PCA depend on the scaling of your data. If, for example, your raw data has one dimension that is on the order of $10^2$ and another on the order of $10^6$, you may run into trouble.

Exploring this point, I'm going to report test classification accuracy before & after applying PCA as a dimensionality reduction technique. Since, as MacKay points out, variable normalizations are important, I tried each of the following combinations of normalization & PCA before classification:

None.
PCA on the raw data.
PCA on sphered data (each dimension has mean 0, variance 1).
PCA on 0-to-1 normalized data (each dimension is squished to be between 0 and 1).
ZCA-whitening on the raw data (a rotation & scaling that results in identity covariance).
PCA on ZCA-whitened data.

Some experimental details:

Each RF was trained to full depth with $100$ trees and $\sqrt{d}$ features sampled at each split. I used MATLAB & this RF package.
The random forest is not sensitive to any dimension-wise normalizations, and that's why I don't bother comparing RF on the raw data to RF on standard normalized & 0-1 normalized data.\The performance is identical! (That's one of many reasons why we <3 random forests).
PCA in the above experiments is always applied as a dimensionality reduction technique - the principal components that explain 99% of the variance are kept, and the rest are thrown out (see details here).
ZCA is usually used as normalization (and not as dimensionality reduction). Rotation does affect the RF, and that's why experiment (5) is included.
PCA and ZCA require the data to have zero mean.
The demeaning, PCA/ZCA transformations, and classifier training were all done on the training data only, and then applied to the held-out test data.

And now, the results in table form.

Dataset	Raw Accuracy	PCA	Sphere + PCA	0-to-1 + PCA	ZCA	ZCA + PCA
proteomics	86.3%	65.4%	83.2%	82.8%	84.3%	82.8%
dnasim	83.1%	75.6%	73.8%	81.5%	85.8%	86.3%
isolet	94.2%	88.6%	87.9%	88.6%	74.2%	87.9%
usps	93.7%	91.2%	90.4%	90.5%	88.2%	88.4%
covertype	86.8%	94.5%	93.5%	94.4%	94.6%	94.5%

Observations:

Applying PCA to the raw data can be disastrous. The proteomics dataset has all kinds of wacky scaling issues, and it shows. Nearly 20% loss in accuracy!
For dnasim, choice of normalization before PCA is significant, but not so much for the other datasets. This demonstrates MacKay's point. In other words: don't just sphere like a "slavish conformist"! Try other normalizations.
Sometimes rotating your data can create problems. ZCA keeps all the dimensions & the accuracy still drops for proteomics, isolet, and USPS. Probably because a bunch of the very noisy dimensions are mixed in with all the others, effectively adding noise where there was little before.
Try ZCA and PCA - you might get a fantastic boost in accuracy. The covertype accuracy in this post is better than every covertype accuracy Alex reported in his previous post.
I also ran these experiments with a 3-tree random forest, and the above trends are still clear. In other words, you can efficiently figure out which combo of normalization and PCA/ZCA is right for your dataset.

There is no simple story here. What these experiments have taught me is (i) don't apply PCA or ZCA blindly but (ii) do try PCA and ZCA, they have the potential to improve performance significantly. Validate your algorithmic choices!

Addendum: a table with dimensionality before & after PCA with various normalizations:

Dataset	Original Dimension	PCA	Sphere + PCA	0-to-1 + PCA	ZCA + PCA
proteomics	109	3	57	24	65
dnasim	403	2	1	3	13
isolet	617	381	400	384	606
usps	256	168	190	168	253
covertype	54	36	48	36	49

Friday, July 6, 2012

Quick classifiers for exploring medium-sized data (redux)

In my last post, I wanted to find out which classifiers are quick enough to allow us to re-train them many times on medium-sized data. A recap of terminology:

Small Data is the magical and care-free land where you can work on a single computer and do something/anything with "all pairs of samples" (evaluate kernels, compute distances, etc...). It's a mostly mythical place, occupied only by Intro ML students and whoever keeps working on Gaussian Processes.
If you can't fit your whole dataset on one machine and/or your computation needs to be distributed, then congratulations: you've got Big Data (now go learn Mahout)
Between Small Data and Big Data is the very common scenario when you are working on a single computer but have too many samples for anything but learning algorithms which scale linearly. Goodbye kernel methods, hello simple boundaries and decision trees.

Previously, I evaluated the runtimes and accuracies of all the scalable classifiers in scikit-learn and found that only Gaussian Naive Bayes and SVM trained by stochastic gradient descent were fast enough. Unfortunately, my evaluation had some obvious shortcomings:

I used a single synthetic dataset-- who knows if my results generalize to "real world" data?
More specifically, the dataset was sampled from 10 multivariate Gaussians, giving a significant advantage to models which assume the distribution underlying the data is Gaussian (ahem...Gaussian Naive Bayes).
I recommended using SGD SVM, but only tried training Logistic Regression with liblinear. However, it's known that linear SVMs and Logistic Regression often get similar performance, so I should have tried SGD Logistic Regression as well.
I trained the SGD SVM with the recommended ~1 million updates, but where does that heuristic even come from? Would fewer updates still yield good accuracy? If so, SGD classifiers could be trained even faster (hard to believe, given how uncannily quick they already are).
All of my tree ensembles (random forests and extra trees) had at least 10 base classifiers. They showed good classification accuracy but were too slow for exploratory use. I should have tried fewer trees to see if I get acceptable accuracy with quicker-to-train ensembles.
Similarly, I never applied regularization to decision trees (either alone or in ensembles) by changing complexity controlling parameters such as the minimum number of samples per leaf. What if I forced the base trees to be simpler? Presumably training would be faster, but would accuracy suffer? I assume that single decision trees might benefit from lowered variance, whereas ensemble methods might do worse.

To get a more complete picture of the speed vs. accuracy landscape, I got three real datasets (UCI's covertype & isolet, as well as some proteomics data) and evaluated scikit-learn's classifiers across a wider swath of parameters. I'll save you from squinting at big tables of numbers and summarize the results first:

Small ensembles of decision trees work! Yes, people will often recommend hundreds of trees (grown to full height). However, the performance of Random Forests seems to degrade gracefully as you limit both the number of trees in the ensemble and the complexity of each individual tree. Whereas 128 full-grown trees might get 88.7% accuracy (on the forest cover data set), a wimpy little duo of 2 shorter trees can get 80.7%. The difference in training times, however, is ~700 seconds vs. ~10 seconds. Squeezing out that extra 8% of accuracy might be worth it when you're constructing your final classification system but not necessarily when you're exploring large space of feature selection/transformation decisions.
Gradient boosting ain't so great...and it sometimes seems to fail catastrophically. Perhaps this is a problem with some default parameters in scikit-learn's implementation?
If you're estimating the variance of a distribution, use smoothing/regularization. Quadratic Discriminant Analysis and Gaussian Naive Bayes break down when a feature appears constant under a particular class label (remember, you're going to be dividing by that zero-variance). Be sure to smooth your variance estimates, either in a rigorous/justified way or by just adding some little epsilon of variance to every feature's estimate.
It doesn't matter whether you use Linear SVM or Logistic Regression. SVM seems to sometime achieve faster training times (since it presumably only has to update weights when a prediction is wrong), whereas Logistic Regression seems to achieve slightly better generalization. But really, I would need to look at many more datasets before I could claim there is substantive difference.
Stochastic gradient descent really is blazing fast-- but also delicate. Linear learners can be trained 10X - 100X faster with stochastic gradient descent but choosing the wrong number of training iterations can sometimes ruin your classifier's accuracy. Perhaps using some stopping criterion might work better than just have a fixed number of epochs?

Recommendations

If you only have enough time to use a single type of model, I would recommend a small Random Forest (with 3-7 trees). If even that is too slow, then try a linear SVM trained using stochastic gradient descent (the ~1 million update heuristic seems reasonable). However, since different classifiers seem to excel on different datasets, it would be best to train several diverse models (the two above, as well as regularized Naive Bayes and LDA) and use the best score any of them can achieve. That way, you might avoid tailoring your features too closely to the assumptions of one particular model.

The Numbers (here be dragoons)

Forest Cover Type

Given features such as soil type, elevation, distance to water sources predict which sort of tree is growing (i.e., "Spruce Fir" vs. "Ponderosa Pine").

Dimensionality: 54
Training samples: 290,504
Test samples: 290,508
Number of classes: 7
Accuracy if we always predict the most common class: 48.86%
Source: UCI

Algorithm	Training / Test Time	Training / Test Accuracy
Gaussian Naive Bayes	1.08s / 2.55s	9.66% / 9.69%
Regularized Gaussian Naive Bayes	1.28s / 1.87s	63.08% / 63.11%
LDA	3.97s / 0.24s	67.94% / 68.02%
QDA	2.92s / 8.94s	8.38% / 8.41%
CART
Decision Tree (min samples per leaf = 1)	14.07s / 0.13s	100.00% / 92.58%
Decision Tree (min samples per leaf = log n = 13)	13.38s / 0.13s	97.35% / 91.73%
Decision Tree (min samples per leaf = sqrt n = 538)	10.40s / 0.11s	81.54% / 80.62%
Support Vector Machine
Linear SVM (SGD, ~250k updates = 1 epochs)	1.14s / 0.17s	69.72% / 69.89%
Linear SVM (SGD, ~1m updates = 4 epochs)	4.38s / 0.17s	71.25% / 71.30%
Linear SVM (SGD, ~2m updates = 7 epochs)	7.67s / 0.17s	71.00% / 71.17%
Linear SVM (liblinear)	50.79s / 0.17s	71.21% / 71.34%
Logistic Regression
Logisic Regression (SGD, ~250k updates = 1 epoch)	1.44s / 0.17s	70.06% / 70.26%
Logisic Regression (SGD, ~1m updates = 4 epochs)	5.61s / 0.17s	71.23% / 71.30%
Logisic Regression (SGD, ~2m updates = 7 epochs)	9.81s / 0.17s	71.15% / 71.24%
Logistic Regression (liblinear)	21.56s / 0.17s	71.44% / 71.54%
Extremely Randomized Trees
Extra-Trees (2 trees)	9.30s / 0.26s	75.30% / 74.23%
Extra-Trees (4 trees)	18.39s / 0.54s	78.49% / 77.18%
Extra-Trees (8 trees)	38.67s / 1.04s	80.47% / 78.71%
Extra-Trees (16 trees)	83.77s / 2.26s	80.84% / 78.98%
Extra-Trees (32 trees)	139.69s / 4.00s	79.57% / 77.92%
Extra-Trees (64 trees)	280.71s / 8.01s	79.57% / 77.98%
Extra-Trees (128 trees)	579.19s / 17.25s	79.37% / 77.78%
Gradient Boosting (sample size = 100%)
Gradient Boosting (sample size = 100%, 2 stumps)	21.33s / 0.32s	67.06% / 67.25%
Gradient Boosting (sample size = 100%, 4 stumps)	41.75s / 0.45s	69.28% / 69.41%
Gradient Boosting (sample size = 100%, 8 stumps)	81.61s / 0.68s	70.53% / 70.58%
Gradient Boosting (sample size = 100%, 16 stumps)	163.75s / 1.19s	72.25% / 72.23%
Gradient Boosting (sample size = 100%, 32 stumps)	327.92s / 2.32s	74.28% / 74.15%
Gradient Boosting (sample size = 100%, 64 stumps)	648.88s / 4.14s	76.22% / 76.08%
Gradient Boosting (sample size = 100%, 128 stumps)	1324.65s / 8.23s	77.99% / 77.80%
Gradient Boosting (sample size = 25%)
Gradient Boosting (sample size = 25%, 2 stumps)	12.69s / 0.32s	67.47% / 67.63%
Gradient Boosting (sample size = 25%, 4 stumps)	24.63s / 0.44s	69.32% / 69.43%
Gradient Boosting (sample size = 25%, 8 stumps)	48.25s / 0.69s	70.57% / 70.67%
Gradient Boosting (sample size = 25%, 16 stumps)	97.67s / 1.20s	72.37% / 72.31%
Gradient Boosting (sample size = 25%, 32 stumps)	192.46s / 2.22s	74.53% / 74.45%
Gradient Boosting (sample size = 25%, 64 stumps)	382.37s / 4.15s	76.52% / 76.38%
Gradient Boosting (sample size = 25%, 128 stumps)	770.73s / 8.49s	78.56% / 78.39%
Random Forests (min samples per leaf = 1)
Random Forest (min samples per leaf = 1, 2 trees)	11.19s / 0.25s	85.85% / 80.84%
Random Forest (min samples per leaf = 1, 4 trees)	22.12s / 0.54s	89.96% / 85.40%
Random Forest (min samples per leaf = 1, 8 trees)	46.53s / 1.14s	92.67% / 87.79%
Random Forest (min samples per leaf = 1, 16 trees)	89.64s / 2.16s	92.79% / 88.34%
Random Forest (min samples per leaf = 1, 32 trees)	177.83s / 4.31s	93.01% / 88.53%
Random Forest (min samples per leaf = 1, 64 trees)	352.32s / 8.41s	93.19% / 88.63%
Random Forest (min samples per leaf = 1, 128 trees)	698.02s / 17.02s	93.26% / 88.70%
Random Forests (min samples per leaf = log n)
Random Forest (min samples per leaf = 13, 2 trees)	10.43s / 0.24s	82.73% / 80.70%
Random Forest (min samples per leaf = 13, 4 trees)	19.62s / 0.45s	85.42% / 83.40%
Random Forest (min samples per leaf = 13, 8 trees)	38.41s / 0.85s	86.44% / 84.49%
Random Forest (min samples per leaf = 13, 16 trees)	74.81s / 1.77s	86.61% / 84.65%
Random Forest (min samples per leaf = 13, 32 trees)	153.10s / 3.36s	87.35% / 85.35%
Random Forest (min samples per leaf = 13, 64 trees)	300.55s / 6.69s	87.43% / 85.40%
Random Forest (min samples per leaf = 13, 128 trees)	608.06s / 16.93s	87.45% / 85.48%
Random Forests (min samples per leaf = sqrt n)
Random Forest (min samples per leaf = 538, 2 trees)	7.00s / 0.19s	71.10% / 70.96%
Random Forest (min samples per leaf = 538, 4 trees)	13.48s / 0.37s	72.39% / 72.28%
Random Forest (min samples per leaf = 538, 8 trees)	27.61s / 0.70s	73.62% / 73.54%
Random Forest (min samples per leaf = 538, 16 trees)	53.80s / 1.36s	74.18% / 74.13%
Random Forest (min samples per leaf = 538, 32 trees)	108.25s / 2.76s	74.29% / 74.17%
Random Forest (min samples per leaf = 538, 64 trees)	212.23s / 5.50s	74.23% / 74.09%
Random Forest (min samples per leaf = 538, 128 trees)	423.53s / 10.91s	74.08% / 74.01%

Proteomics

The task is to classify the mass/charge ratio of peptides using some hand-crafted features of their spectra.

Dimensionality: 109
Training samples: 25,780
Test samples: 25,784
Number of classes: 6
Accuracy if we always predict the most common class: 49.25%
Source: Sergey Feldman

Algorithm	Training / Test Time	Training / Test Accuracy
Gaussian Naive Bayes	0.16s / 0.25s	17.18% / 17.40%
Regularized Gaussian Naive Bayes	0.19s / 0.26s	5.51% / 5.38%
LDA	failed
QDA	failed
CART
Decision Tree (min samples per leaf = 1)	3.44s / 0.01s	100.00% / 77.48%
Decision Tree (min samples per leaf = log n = 11)	3.28s / 0.01s	95.65% / 77.39%
Decision Tree (min samples per leaf = sqrt n = 160)	2.69s / 0.01s	83.71% / 79.42%
Support Vector Machine
Linear SVM (SGD, ~250k updates = 10 epochs)	1.05s / 0.06s	36.81% / 36.89%
Linear SVM (SGD, ~1m updates = 39 epochs)	3.98s / 0.06s	61.49% / 61.40%
Linear SVM (SGD, ~2m updates = 78 epochs)	8.04s / 0.06s	47.92% / 46.92%
Linear SVM (liblinear)	109.26s / 0.07s	56.18% / 56.18%
Logistic Regression
Logisic Regression (SGD, ~250k updates = 10 epochs)	1.08s / 0.06s	37.84% / 37.87%
Logisic Regression (SGD, ~1m updates = 39 epochs)	4.12s / 0.06s	37.16% / 37.29%
Logisic Regression (SGD, ~2m updates = 78 epochs)	8.22s / 0.06s	51.12% / 50.48%
Logistic Regression (liblinear)	280.85s / 0.07s	58.66% / 58.18%
Extremely Randomized Trees
Extra-Trees (2 trees)	3.81s / 0.04s	100.00% / 72.67%
Extra-Trees (4 trees)	7.88s / 0.08s	100.00% / 78.80%
Extra-Trees (8 trees)	15.17s / 0.14s	100.00% / 82.09%
Extra-Trees (16 trees)	29.39s / 0.28s	100.00% / 84.15%
Extra-Trees (64 trees)	119.16s / 1.14s	100.00% / 85.53%
Extra-Trees (32 trees)	58.78s / 0.59s	100.00% / 85.09%
Extra-Trees (128 trees)	233.06s / 2.23s	100.00% / 85.89%
Gradient Boosting (sample size = 100%)
Gradient Boosting (sample size = 100%, 2 stumps)	3.99s / 0.03s	58.70% / 58.49%
Gradient Boosting (sample size = 100%, 4 stumps)	7.82s / 0.04s	56.73% / 56.49%
Gradient Boosting (sample size = 100%, 8 stumps)	15.37s / 0.07s	58.73% / 58.62%
Gradient Boosting (sample size = 100%, 16 stumps)	31.09s / 0.11s	43.50% / 43.57%
Gradient Boosting (sample size = 100%, 32 stumps)	61.36s / 0.21s	42.71% / 42.79%
Gradient Boosting (sample size = 100%, 64 stumps)	118.30s / 0.39s	15.08% / 15.31%
Gradient Boosting (sample size = 100%, 128 stumps)	230.85s / 0.78s	20.24% / 20.04%
Gradient Boosting (sample size = 25%)
Gradient Boosting (sample size = 25%, 2 stumps)	2.17s / 0.03s	46.67% / 46.51%
Gradient Boosting (sample size = 25%, 4 stumps)	4.14s / 0.04s	44.78% / 44.85%
Gradient Boosting (sample size = 25%, 8 stumps)	8.10s / 0.07s	43.36% / 43.28%
Gradient Boosting (sample size = 25%, 16 stumps)	16.17s / 0.12s	45.05% / 45.04%
Gradient Boosting (sample size = 25%, 32 stumps)	31.77s / 0.22s	22.70% / 22.88%
Gradient Boosting (sample size = 25%, 64 stumps)	61.40s / 0.42s	32.85% / 33.07%
Gradient Boosting (sample size = 25%, 128 stumps)	100.16s / 0.65s	49.26% / 49.25%
Random Forests (min samples per leaf = 1)
Random Forest (min samples per leaf = 1, 2 trees)	2.57s / 0.03s	90.58% / 73.65%
Random Forest (min samples per leaf = 1, 4 trees)	5.07s / 0.07s	96.53% / 79.84%
Random Forest (min samples per leaf = 1, 8 trees)	10.20s / 0.13s	98.88% / 83.20%
Random Forest (min samples per leaf = 1, 16 trees)	20.47s / 0.27s	99.66% / 84.79%
Random Forest (min samples per leaf = 1, 32 trees)	40.82s / 0.54s	99.95% / 85.78%
Random Forest (min samples per leaf = 1, 64 trees)	82.00s / 1.07s	99.99% / 86.11%
Random Forest (min samples per leaf = 1, 128 trees)	165.42s / 2.17s	100.00% / 86.40%
Random Forests (min samples per leaf = log n)
Random Forest (min samples per leaf = 11, 2 trees)	2.17s / 0.03s	87.35% / 80.06%
Random Forest (min samples per leaf = 11, 4 trees)	4.38s / 0.07s	89.78% / 82.80%
Random Forest (min samples per leaf = 11, 8 trees)	8.75s / 0.13s	91.41% / 84.13%
Random Forest (min samples per leaf = 11, 16 trees)	17.35s / 0.25s	92.00% / 84.94%
Random Forest (min samples per leaf = 11, 32 trees)	35.08s / 0.51s	92.47% / 85.51%
Random Forest (min samples per leaf = 11, 64 trees)	70.10s / 1.02s	92.56% / 85.74%
Random Forest (min samples per leaf = 11, 128 trees)	139.47s / 2.01s	92.60% / 85.84%
Random Forests (min samples per leaf = sqrt n)
Random Forest (min samples per leaf = 160, 2 trees)	1.61s / 0.03s	79.59% / 78.49%
Random Forest (min samples per leaf = 160, 4 trees)	3.22s / 0.05s	81.67% / 80.91%
Random Forest (min samples per leaf = 160, 8 trees)	6.37s / 0.10s	82.60% / 81.99%
Random Forest (min samples per leaf = 160, 16 trees)	12.83s / 0.21s	83.17% / 82.50%
Random Forest (min samples per leaf = 160, 32 trees)	26.01s / 0.42s	83.23% / 82.68%
Random Forest (min samples per leaf = 160, 64 trees)	51.79s / 0.82s	83.41% / 82.90%
Random Forest (min samples per leaf = 160, 128 trees)	102.06s / 1.64s	83.41% / 82.94%

Isolet

Recognize which letter is being spoken from hand-crafted auditory features. I don't really think this dataset is large enough to qualify as "medium-sized", but I still threw it in for comparison.

Dimensionality: 617
Training samples: 6,238
Test samples: 1,559
Number of classes: 26
Baseline test accuracy: 3.85%
Source: UCI

Algorithm	Training / Test Time	Training / Test Accuracy
Gaussian Naive Bayes	0.46s / 0.58s	84.23% / 80.12%
Regularized Gaussian Naive Bayes	0.51s / 0.58s	89.87% / 88.65%
LDA	4.18s / 0.22s	97.52% / 94.16%
QDA	failed
CART
Decision Tree (min samples per leaf = 1)	4.52s / 0.00s	100.00% / 79.92%
Decision Tree (min samples per leaf = log n = 9)	4.39s / 0.00s	95.70% / 79.54%
Decision Tree (min samples per leaf = sqrt n = 78)	4.03s / 0.00s	88.35% / 79.28%
Support Vector Machine
Linear SVM (SGD, ~250k updates / 41 epochs)	9.12s / 0.06s	99.90% / 95.00%
Linear SVM (SGD, ~1 updates / 161 epochs)	35.12s / 0.07s	100.00% / 94.48%
Linear SVM (SGD, ~2m updates / 321 epochs)	70.07s / 0.06s	100.00% / 94.36%
Linear SVM (liblinear)	13.10s / 0.05s	100.00% / 94.55%
Logistic Regression
Logisic Regression (SGD, ~250k updates / 41 epochs)	19.39s / 0.06s	99.98% / 95.00%
Logisic Regression (SGD, ~1m updates / 161 epochs)	74.00s / 0.06s	100.00% / 95.57%
Logisic Regression (SGD, ~2m updates / 321 epochs)	147.53s / 0.06s	100.00% / 95.57%
Logistic Regression (liblinear)	42.67s / 0.05s	99.97% / 95.51%
Extremely Randomized Trees
Extra-Trees (2 trees)	2.52s / 0.01s	100.00% / 66.77%
Extra-Trees (4 trees)	4.75s / 0.02s	100.00% / 79.09%
Extra-Trees (8 trees)	9.34s / 0.03s	100.00% / 87.94%
Extra-Trees (16 trees)	18.39s / 0.06s	100.00% / 92.05%
Extra-Trees (32 trees)	36.38s / 0.11s	100.00% / 93.46%
Extra-Trees (64 trees)	72.55s / 0.22s	100.00% / 94.74%
Extra-Trees (128 trees)	144.06s / 0.44s	100.00% / 95.38%
Gradient Boosting (sample size = 100%)
Gradient Boosting (sample size = 100%, 2 stumps)	20.19s / 0.01s	4.14% / 4.43%
Gradient Boosting (sample size = 100%, 4 stumps)	40.19s / 0.01s	3.91% / 4.87%
Gradient Boosting (sample size = 100%, 8 stumps)	80.14s / 0.02s	4.09% / 4.30%
Gradient Boosting (sample size = 100%, 16 stumps)	160.47s / 0.03s	3.77% / 4.62%
Gradient Boosting (sample size = 100%, 32 stumps)	319.10s / 0.08s	3.77% / 4.43%
Gradient Boosting (sample size = 100%, 64 stumps)	630.17s / 0.13s	3.82% / 4.36%
Gradient Boosting (sample size = 100%, 128 stumps)	1251.35s / 0.25s	3.69% / 3.85%
Gradient Boosting (sample size = 25%)
Gradient Boosting (sample size = 25%, 2 stumps)	10.08s / 0.01s	3.70% / 3.98%
Gradient Boosting (sample size = 25%, 4 stumps)	19.79s / 0.01s	3.72% / 3.85%
Gradient Boosting (sample size = 25%, 8 stumps)	39.54s / 0.02s	3.49% / 3.66%
Gradient Boosting (sample size = 25%, 16 stumps)	63.67s / 0.03s	3.85% / 3.85%
Gradient Boosting (sample size = 25%, 32 stumps)	83.68s / 0.02s	3.85% / 3.85%
Gradient Boosting (sample size = 25%, 64 stumps)	124.81s / 0.04s	3.85% / 3.85%
Gradient Boosting (sample size = 25%, 128 stumps)	203.56s / 0.05s	3.85% / 3.85%
Random Forests (min samples per leaf = 1)
Random Forest (min samples per leaf = 1, 2 trees)	2.60s / 0.01s	88.12% / 66.32%
Random Forest (min samples per leaf = 1, 4 trees)	5.45s / 0.02s	97.74% / 81.46%
Random Forest (min samples per leaf = 1, 8 trees)	10.46s / 0.03s	99.71% / 88.65%
Random Forest (min samples per leaf = 1, 16 trees)	21.04s / 0.05s	99.98% / 92.05%
Random Forest (min samples per leaf = 1, 32 trees)	42.44s / 0.11s	100.00% / 93.52%
Random Forest (min samples per leaf = 1, 64 trees)	83.37s / 0.21s	100.00% / 94.10%
Random Forest (min samples per leaf = 1, 128 trees)	168.74s / 0.43s	100.00% / 93.97%
Random Forests (min samples per leaf = log n)
Random Forest (min samples per leaf = 9, 2 trees)	2.32s / 0.01s	87.48% / 76.78%
Random Forest (min samples per leaf = 9, 4 trees)	4.66s / 0.01s	93.78% / 85.63%
Random Forest (min samples per leaf = 9, 8 trees)	9.32s / 0.03s	96.73% / 89.35%
Random Forest (min samples per leaf = 9, 16 trees)	18.82s / 0.05s	98.12% / 92.62%
Random Forest (min samples per leaf = 9, 32 trees)	37.40s / 0.10s	98.59% / 93.33%
Random Forest (min samples per leaf = 9, 64 trees)	75.41s / 0.21s	98.75% / 94.03%
Random Forest (min samples per leaf = 9, 128 trees)	152.74s / 0.44s	98.86% / 94.16%
Random Forests (min samples per leaf = sqrt n)
Random Forest (min samples per leaf = 78, 2 trees)	2.01s / 0.01s	73.87% / 69.53%
Random Forest (min samples per leaf = 78, 4 trees)	4.02s / 0.01s	82.57% / 78.51%
Random Forest (min samples per leaf = 78, 8 trees)	8.00s / 0.02s	88.14% / 85.05%
Random Forest (min samples per leaf = 78, 16 trees)	15.87s / 0.05s	90.85% / 88.65%
Random Forest (min samples per leaf = 78, 32 trees)	31.94s / 0.11s	91.66% / 90.64%
Random Forest (min samples per leaf = 78, 64 trees)	64.22s / 0.20s	92.29% / 90.38%
Random Forest (min samples per leaf = 78, 128 trees)	128.47s / 0.40s	92.40% / 90.96%

Tuesday, July 3, 2012

Speeding Up Greedy Feature Selection

In the last post Alex wrote:

"While you can choose and manipulate your features by divination, tea leaf reading, or good old-fashioned gut-following, it's better to actually check whether what you're doing improves performance on a validation set. The trouble is that repeatedly training a classifier can become cumbersome once you leave the cozy world of Small Data. Even worse, if you want to use some sort of automatic feature selection algorithm (which requires training hundreds or thousands of classifiers), you might find yourself waiting for a very very long time."

My favorite variant of feature selection is greedy forward feature selection (GFFS). First, some notation. We have $N$ data points $(x_i,y_i)$ where $x_i$ is a $d$-dimensional feature descriptor and $y_i$ is its label, and we want to figure out which of the $d$ dimensions are useful, or maybe rank the dimensions in order of usefulness. GFFS adds one feature dimension at a time to a set of already selected features, and checks how good that feature is by training & testing classifiers on $k$ cross-validation splits. The best feature (in terms of average accuracy) is then added to the set of selected features, and the the next iteration begins.

For example, let's say you've already selected feature number $3$. The algorithm trains and tests the following feature combinations and obtains the corresponding accuracies:

[3 1] - 65.5%
[3 2] - 63.5%
[3 4] - 67.5%
[3 5] - 70.5% <--- best
...
[3 d] - 62.5%

Feature 3 got the best average hold-out accuracy in combination with feature 5. So feature 5 is added to the already-chosen set, and the algorithm goes on to the next stage.

The computation time for a fully naive implementation of GFFS is... stupidly big: $O(d^2 k C)$, where $C$ is the complexity of the classifier being used. In the last post, Alex investigated which classifier is efficient enough to make this feasible. I found that even using the fastest classifier, the time actual runtime complexity was too ridiculous to be practical (unless you have a cluster). So... how can we practically speed up GFFS?

Some Ideas

I implemented and tested two ideas:

Early stopping - stop the outer loop if there has been no improvement for $p$ epochs ($p=5$ in all my experiments).
Random subsampling - during each epoch, only test $\sqrt{d}$ random dimensions instead of all $d$ dimensions.

Data & Experiments

I tested the above two ideas as follows:

Split the data into training and test. Get test accuracy using all $d$ dimensions
Send the training data only into GFFS.
Get test accuracy using whichever dimensions are returned from (2).

The best case scenario is that GFFS will get rid of a bunch of noisy dimensions and test accuracy will actually improve. An acceptable scenario is that GFFS will get rid of a bunch of redundant dimensions and test accuracy will stay the same.

For the classifier, I used quadratic discriminant analysis (QDA), and added $0.1 I$ to the covariance estimates for regularization (chosen without validation).

I tested on three standard datasets: isolet, usps, covertype (with my own training/test split), and two datasets from my research: proteomics and dnasim.

Results

First, the accuracies before and after GFFS. ES means early stopping. RS means random subsampling. A fully naive version of GFFS without early stopping wasn't even close to done running after a whole weekend, so I just stopped it.

Datasets	QDA Test Accuracy	Early Stopping	Early Stopping + Random Subsampling
proteomics	66.8%	79.9%	74.7%
dnasim	64.7%	71.7%	64.8%
isolet	95.9%	94.0%	94.6%
usps	94.0%	92.7%	93.9%
covertype	62.9%	64.0%	62.5%

Using the exhaustive GFFS (with early stopping) can provide some significant gains (proteomics & dnasim), but can also damage performance a bit (isolet and usps). Adding random subsampling decreases gains somewhat, but not entirely, and can also undo some of the damage caused by feature selection (because more features are usually chosen when using RS). Looks pretty good, but let's look at the actual runtimes (reported in minutes):

Datasets	Early Stopping	Early Stopping + Random Subsampling
proteomics	5.1 min	0.5 min
dnasim	0.5 min	0.1 min
isolet	698.6 min	41.2 min
usps	115.4 min	4.6 min
covertype	99.5 min	46.0 min

There we go! The time savings range from 50% to 99%.

In conclusion: greedy forward feature selection is great, but way too expensive. It is still great and much faster if you add (i) early stopping and (ii) random subsampling.

Next time (maybe):

Everyone loves PCA. Is preprocessing your data with PCA actually a good idea in terms of classification performance?
Feature transforms. There are a billion of them. Which ones are a good idea, and when?