Friday, June 22, 2012

Which classifiers are fast enough for exploring medium-sized data?

When I'm constructing a classification pipeline, there are usually a great number of choices to be made about feature selection/transformation/extraction. I often find myself thinking things like:
  • "Is number of burps a useful feature?" 
  • "Should I take the log of lollipops per hectare?" 
  • "I don't know what structure this time series actually has, so fuck it, I'll try wavelets..."
  • "Andrew Ng told me learning a sparse overcomplete feature representation might work..."
etc..

While you can choose and manipulate your features by divination, tea leaf reading, or good old-fashioned gut-following, it's better to actually check whether what you're doing improves performance on a validation set. The trouble is that repeatedly training a classifier can become cumbersome once you leave the cozy world of Small Data. Even worse, if you want to use some sort of automatic feature selection algorithm (which requires training hundreds or thousands of classifiers), you might find yourself waiting for a very very long time.

So which classifiers can I realistically use for exploring medium-sized data sets (those small enough to fit in memory, but too big for something silly like constructing a kernel matrix)? I tested a variety of scikit-learn's classifiers to see how they compare.

Data 

I generated 50,000 samples from 10 distinct 500-dimensional gaussians (with random means and random diagonal covariance matrices), and randomly assigned each sample to either a test or training set. This yielded a data set with 10 classes, 500 dimensions, 24,853 training samples and 25,147 test samples.

In addition to checking the runtime and accuracy of each classifier, I also wanted to know how well they cope with noisy/irrelevant features. To this end, I made a copy of the data with 258 of the 500 features replaced by uniform noise.

Results


Algorithm Training/Test Time Training/Test Accuracy Corrupt Feature Accuracy
Linear Discriminant Analysis 9.16s / 0.23s 51.64% / 46.19% 37.39% / 30.96%
Quadratic Discriminant Analysis 14.78s / 34.27s 100% / 100% 100% / 100%
Linear SVM (trained with liblinear) 449.9s / 0.49s 38.81% / 34.95% 25.26% / 21.27%
Linear SVM (trained with SGD) 3.8s / 0.13s 50.13% / 43.68% 24.4% / 21.38%
L1-penalized SVM (trained with SGD) 27.49 s / 0.13s 50.29% / 43.39% 33.81% / 26.5%
Logistic Regression 19.74s / 0.49s 51.69% / 45.56% 37.24% / 30.37%
Gaussian Naive Bayes 0.73s / 5.14s 100% / 100% 100% / 100%
Decision Tree (CART) 20.4s / 0.1s 100% / 61% 100% / 58.9%
Random Forest (10 trees) 49.45s / 1.01s 99.96% / 84.32% 99.93% / 74.2%
Random Forest (20 trees) 98.62s / 2.02s 100% / 95.2% 100% / 88.8%
Random Forest (40 trees) 197.56s / 4.03s 100% / 99.2% 100% / 96.75%
Random Forest (80 trees) 393.5s / 8.08s 100% / 99.96% 100% / 99.26%
Random Forest (160 trees) 789.2s / 16.01s 100% / 100% 100% / 99.82%
Extremely Randomized Tree 6.58s / 0.1s 100% / 35.4.6% 100% / 29%
Extra Trees (10 trees) 57.94s / 1.04s 100% / 72.6% 100% / 55.4%
Extra Trees (20 trees) 110.84s / 2.12s 100% / 90.1% 100% / 72.56%
Extra Trees (40 trees) 221.43s / 4.23s 100% / 98.16% 100% / 88.99%
Extra Trees (80 trees) 441.7s / 8.46s 100% / 99.87% 100% / 88.99%
Extra Trees (160 trees) 878.2s / 16.84s 100% / 100% 100% / 99.4%
Gradient Boosting (4 trees per class) 59.93s / 0.23s 82.9% / 80.59% 75.22% / 72.91%
Gradient Boosting (8 trees per class) 118.66s / 0.36s 93.85% / 92.59% 89.04% / 87.19%
Gradient Boosting (16 trees per class) 234.78s / 0.61s 98.75% / 97.94% 96.28% / 94.72%
Gradient Boosting (32 trees per class) 464.89s / 1.11s 99.9% / 99.72% 99.2% / 98.52%

Observations

  • If you know the process by which your data was generated then consider your work done. In this case, the data was sampled from 10 Gaussian distributions and (unsurprisingly) QDA and Gaussian Naive Bayes both do very well. If I had used Gaussians with non-zero off-diagonal covariances (interactions between features), then I would expect Gaussian Naive Bayes to do worse but QDA would still capture the best possible class boundaries. 
  • SVM and Logistic Regression are very sensitive to the actual margin between classes. I retrained the linear learners with better separated class means and found their test accuracy went up to ~99% (whereas tree-based algorithms only improved slightly). 
  • Stochastic gradient descent for L2-regularized learners is blazing fast. Kudos to Peter Prettenhofer for implementing something which makes every other linear learner look like a slug.
  • Though still fast compared to the glacial alternative of using a convex solver, SGD for L1-penalized learning is noticeably slower. 
  • As advertised, a single Extremely Randomized Tree is terrible by itself. An Extra Tree ensemble, however, seems competitive with Random Forests for clean data, but shows worse degradation when trained on corrupted features. This makes sense, since a Random Forest's tree-growing algorithm will get a chance to skip useless features (and an Extremely Randomized Tree uses all features with equal probability). 
  • Only Naive Bayes and SGD-SVM are cheap enough for use with greedy feature selection. 

Next Time

  • Since the base learners of a Random Forest took ~5 seconds to train, can random forests for 2 or 3 trees give good enough accuracy to be useful?
  • What if we use smaller bootstrap samples (train each tree on 10% or even 1% of the data), how badly will accuracy suffer?
  • Do any of the "expensive" algorithms become cheaper on a differently shaped data set (more rows, fewer features)?
  • The SGD learners above performed about ~1 million weight updates (the suggested heuristic from the sklearn page). Presumably I could get a 10X performance gain by using only 10^5 updates. How badly will this impact generalization/test error?
  • Simulated Gaussian data is totally unrepresentative! How do these algorithms behave on real data? I'm going to try them out on some intra-day currency time series and some other data we found lying around on kaggle.