Friday, July 6, 2012

Quick classifiers for exploring medium-sized data (redux)

In my last post, I wanted to find out which classifiers are quick enough to allow us to re-train them many times on medium-sized data. A recap of terminology:
  • Small Data is the magical and care-free land where you can work on a single computer and do something/anything with "all pairs of samples" (evaluate kernels, compute distances, etc...). It's a mostly mythical place, occupied only by Intro ML students and whoever keeps working on Gaussian Processes.
  • If you can't fit your whole dataset on one machine and/or your computation needs to be distributed, then congratulations: you've got Big Data (now go learn Mahout)
  • Between Small Data and Big Data is the very common scenario when you are working on a single computer but have too many samples for anything but learning algorithms which scale linearly. Goodbye kernel methods, hello simple boundaries and decision trees.
Previously, I evaluated the runtimes and accuracies of all the scalable classifiers in scikit-learn and found that only Gaussian Naive Bayes and SVM trained by stochastic gradient descent were fast enough. Unfortunately, my evaluation had some obvious shortcomings:
  • I used a single synthetic dataset-- who knows if my results generalize to "real world" data?
  • More specifically, the dataset was sampled from 10 multivariate Gaussians, giving a significant advantage to models which assume the distribution underlying the data is Gaussian (ahem...Gaussian Naive Bayes).
  • I recommended using SGD SVM, but only tried training Logistic Regression with liblinear. However, it's known that linear SVMs and Logistic Regression often get similar performance, so I should have tried SGD Logistic Regression as well.
  • I trained the SGD SVM with the recommended ~1 million updates, but where does that heuristic even come from? Would fewer updates still yield good accuracy? If so, SGD classifiers could be trained even faster (hard to believe, given how uncannily quick they already are).
  • All of my tree ensembles (random forests and extra trees) had at least 10 base classifiers. They showed good classification accuracy but were too slow for exploratory use. I should have tried fewer trees to see if I get acceptable accuracy with quicker-to-train ensembles.
  • Similarly, I never applied regularization to decision trees (either alone or in ensembles) by changing complexity controlling parameters such as the minimum number of samples per leaf. What if I forced the base trees to be simpler? Presumably training would be faster, but would accuracy suffer? I assume that single decision trees might benefit from lowered variance, whereas ensemble methods might do worse.
To get a more complete picture of the speed vs. accuracy landscape, I got three real datasets (UCI's covertype & isolet, as well as some proteomics data) and evaluated scikit-learn's classifiers across a wider swath of parameters. I'll save you from squinting at big tables of numbers and summarize the results first:
  • Small ensembles of decision trees work! Yes, people will often recommend hundreds of trees (grown to full height). However, the performance of Random Forests seems to degrade gracefully as you limit both the number of trees in the ensemble and the complexity of each individual tree. Whereas 128 full-grown trees might get 88.7% accuracy (on the forest cover data set), a wimpy little duo of 2 shorter trees can get 80.7%. The difference in training times, however, is ~700 seconds vs. ~10 seconds. Squeezing out that extra 8% of accuracy might be worth it when you're constructing your final classification system but not necessarily when you're exploring large space of feature selection/transformation decisions.
  • Gradient boosting ain't so great...and it sometimes seems to fail catastrophically. Perhaps this is a problem with some default parameters in scikit-learn's implementation?
  • If you're estimating the variance of a distribution, use smoothing/regularization. Quadratic Discriminant Analysis and Gaussian Naive Bayes break down when a feature appears constant under a particular class label (remember, you're going to be dividing by that zero-variance). Be sure to smooth your variance estimates, either in a rigorous/justified way or by just adding some little epsilon of variance to every feature's estimate.
  • It doesn't matter whether you use Linear SVM or Logistic Regression. SVM seems to sometime achieve faster training times (since it presumably only has to update weights when a prediction is wrong), whereas Logistic Regression seems to achieve slightly better generalization. But really, I would need to look at many more datasets before I could claim there is substantive difference.
  • Stochastic gradient descent really is blazing fast-- but also delicate. Linear learners can be trained 10X - 100X faster with stochastic gradient descent but choosing the wrong number of training iterations can sometimes ruin your classifier's accuracy. Perhaps using some stopping criterion might work better than just have a fixed number of epochs?

Recommendations

If you only have enough time to use a single type of model, I would recommend a small Random Forest (with 3-7 trees). If even that is too slow, then try a linear SVM trained using stochastic gradient descent (the ~1 million update heuristic seems reasonable). However, since different classifiers seem to excel on different datasets, it would be best to train several diverse models (the two above, as well as regularized Naive Bayes and LDA) and use the best score any of them can achieve. That way, you might avoid tailoring your features too closely to the assumptions of one particular model.

The Numbers (here be dragoons)

Forest Cover Type

Given features such as soil type, elevation, distance to water sources predict which sort of tree is growing (i.e., "Spruce Fir" vs. "Ponderosa Pine").
  • Dimensionality: 54
  • Training samples: 290,504
  • Test samples: 290,508
  • Number of classes: 7
  • Accuracy if we always predict the most common class: 48.86%
  • Source: UCI
Algorithm Training / Test Time Training / Test Accuracy
Gaussian Naive Bayes1.08s / 2.55s9.66% / 9.69%
Regularized Gaussian Naive Bayes1.28s / 1.87s63.08% / 63.11%
LDA3.97s / 0.24s67.94% / 68.02%
QDA2.92s / 8.94s8.38% / 8.41%
CART
Decision Tree (min samples per leaf = 1)14.07s / 0.13s100.00% / 92.58%
Decision Tree (min samples per leaf = log n = 13)13.38s / 0.13s97.35% / 91.73%
Decision Tree (min samples per leaf = sqrt n = 538)10.40s / 0.11s81.54% / 80.62%
Support Vector Machine
Linear SVM (SGD, ~250k updates = 1 epochs)1.14s / 0.17s69.72% / 69.89%
Linear SVM (SGD, ~1m updates = 4 epochs)4.38s / 0.17s71.25% / 71.30%
Linear SVM (SGD, ~2m updates = 7 epochs)7.67s / 0.17s71.00% / 71.17%
Linear SVM (liblinear)50.79s / 0.17s71.21% / 71.34%
Logistic Regression
Logisic Regression (SGD, ~250k updates = 1 epoch)1.44s / 0.17s70.06% / 70.26%
Logisic Regression (SGD, ~1m updates = 4 epochs)5.61s / 0.17s71.23% / 71.30%
Logisic Regression (SGD, ~2m updates = 7 epochs)9.81s / 0.17s71.15% / 71.24%
Logistic Regression (liblinear)21.56s / 0.17s71.44% / 71.54%
Extremely Randomized Trees
Extra-Trees (2 trees)9.30s / 0.26s75.30% / 74.23%
Extra-Trees (4 trees)18.39s / 0.54s78.49% / 77.18%
Extra-Trees (8 trees)38.67s / 1.04s80.47% / 78.71%
Extra-Trees (16 trees)83.77s / 2.26s80.84% / 78.98%
Extra-Trees (32 trees)139.69s / 4.00s79.57% / 77.92%
Extra-Trees (64 trees)280.71s / 8.01s79.57% / 77.98%
Extra-Trees (128 trees)579.19s / 17.25s79.37% / 77.78%
Gradient Boosting (sample size = 100%)
Gradient Boosting (sample size = 100%, 2 stumps)21.33s / 0.32s67.06% / 67.25%
Gradient Boosting (sample size = 100%, 4 stumps)41.75s / 0.45s69.28% / 69.41%
Gradient Boosting (sample size = 100%, 8 stumps)81.61s / 0.68s70.53% / 70.58%
Gradient Boosting (sample size = 100%, 16 stumps)163.75s / 1.19s72.25% / 72.23%
Gradient Boosting (sample size = 100%, 32 stumps)327.92s / 2.32s74.28% / 74.15%
Gradient Boosting (sample size = 100%, 64 stumps)648.88s / 4.14s76.22% / 76.08%
Gradient Boosting (sample size = 100%, 128 stumps)1324.65s / 8.23s77.99% / 77.80%
Gradient Boosting (sample size = 25%)
Gradient Boosting (sample size = 25%, 2 stumps)12.69s / 0.32s67.47% / 67.63%
Gradient Boosting (sample size = 25%, 4 stumps)24.63s / 0.44s69.32% / 69.43%
Gradient Boosting (sample size = 25%, 8 stumps)48.25s / 0.69s70.57% / 70.67%
Gradient Boosting (sample size = 25%, 16 stumps)97.67s / 1.20s72.37% / 72.31%
Gradient Boosting (sample size = 25%, 32 stumps)192.46s / 2.22s74.53% / 74.45%
Gradient Boosting (sample size = 25%, 64 stumps)382.37s / 4.15s76.52% / 76.38%
Gradient Boosting (sample size = 25%, 128 stumps)770.73s / 8.49s78.56% / 78.39%
Random Forests (min samples per leaf = 1)
Random Forest (min samples per leaf = 1, 2 trees)11.19s / 0.25s85.85% / 80.84%
Random Forest (min samples per leaf = 1, 4 trees)22.12s / 0.54s89.96% / 85.40%
Random Forest (min samples per leaf = 1, 8 trees)46.53s / 1.14s92.67% / 87.79%
Random Forest (min samples per leaf = 1, 16 trees)89.64s / 2.16s92.79% / 88.34%
Random Forest (min samples per leaf = 1, 32 trees)177.83s / 4.31s93.01% / 88.53%
Random Forest (min samples per leaf = 1, 64 trees)352.32s / 8.41s93.19% / 88.63%
Random Forest (min samples per leaf = 1, 128 trees)698.02s / 17.02s93.26% / 88.70%
Random Forests (min samples per leaf = log n)
Random Forest (min samples per leaf = 13, 2 trees)10.43s / 0.24s82.73% / 80.70%
Random Forest (min samples per leaf = 13, 4 trees)19.62s / 0.45s85.42% / 83.40%
Random Forest (min samples per leaf = 13, 8 trees)38.41s / 0.85s86.44% / 84.49%
Random Forest (min samples per leaf = 13, 16 trees)74.81s / 1.77s86.61% / 84.65%
Random Forest (min samples per leaf = 13, 32 trees)153.10s / 3.36s87.35% / 85.35%
Random Forest (min samples per leaf = 13, 64 trees)300.55s / 6.69s87.43% / 85.40%
Random Forest (min samples per leaf = 13, 128 trees)608.06s / 16.93s87.45% / 85.48%
Random Forests (min samples per leaf = sqrt n)
Random Forest (min samples per leaf = 538, 2 trees)7.00s / 0.19s71.10% / 70.96%
Random Forest (min samples per leaf = 538, 4 trees)13.48s / 0.37s72.39% / 72.28%
Random Forest (min samples per leaf = 538, 8 trees)27.61s / 0.70s73.62% / 73.54%
Random Forest (min samples per leaf = 538, 16 trees)53.80s / 1.36s74.18% / 74.13%
Random Forest (min samples per leaf = 538, 32 trees)108.25s / 2.76s74.29% / 74.17%
Random Forest (min samples per leaf = 538, 64 trees)212.23s / 5.50s74.23% / 74.09%
Random Forest (min samples per leaf = 538, 128 trees)423.53s / 10.91s74.08% / 74.01%

Proteomics

The task is to classify the mass/charge ratio of peptides using some hand-crafted features of their spectra.
  • Dimensionality: 109
  • Training samples: 25,780
  • Test samples: 25,784
  • Number of classes: 6
  • Accuracy if we always predict the most common class: 49.25%
  • Source: Sergey Feldman
Algorithm Training / Test Time Training / Test Accuracy
Gaussian Naive Bayes0.16s / 0.25s17.18% / 17.40%
Regularized Gaussian Naive Bayes0.19s / 0.26s5.51% / 5.38%
LDAfailed
QDAfailed
CART
Decision Tree (min samples per leaf = 1)3.44s / 0.01s100.00% / 77.48%
Decision Tree (min samples per leaf = log n = 11)3.28s / 0.01s95.65% / 77.39%
Decision Tree (min samples per leaf = sqrt n = 160)2.69s / 0.01s83.71% / 79.42%
Support Vector Machine
Linear SVM (SGD, ~250k updates = 10 epochs)1.05s / 0.06s36.81% / 36.89%
Linear SVM (SGD, ~1m updates = 39 epochs)3.98s / 0.06s61.49% / 61.40%
Linear SVM (SGD, ~2m updates = 78 epochs)8.04s / 0.06s47.92% / 46.92%
Linear SVM (liblinear)109.26s / 0.07s56.18% / 56.18%
Logistic Regression
Logisic Regression (SGD, ~250k updates = 10 epochs)1.08s / 0.06s37.84% / 37.87%
Logisic Regression (SGD, ~1m updates = 39 epochs)4.12s / 0.06s37.16% / 37.29%
Logisic Regression (SGD, ~2m updates = 78 epochs)8.22s / 0.06s51.12% / 50.48%
Logistic Regression (liblinear)280.85s / 0.07s58.66% / 58.18%
Extremely Randomized Trees
Extra-Trees (2 trees)3.81s / 0.04s100.00% / 72.67%
Extra-Trees (4 trees)7.88s / 0.08s100.00% / 78.80%
Extra-Trees (8 trees)15.17s / 0.14s100.00% / 82.09%
Extra-Trees (16 trees)29.39s / 0.28s100.00% / 84.15%
Extra-Trees (64 trees)119.16s / 1.14s100.00% / 85.53%
Extra-Trees (32 trees)58.78s / 0.59s100.00% / 85.09%
Extra-Trees (128 trees)233.06s / 2.23s100.00% / 85.89%
Gradient Boosting (sample size = 100%)
Gradient Boosting (sample size = 100%, 2 stumps)3.99s / 0.03s58.70% / 58.49%
Gradient Boosting (sample size = 100%, 4 stumps)7.82s / 0.04s56.73% / 56.49%
Gradient Boosting (sample size = 100%, 8 stumps)15.37s / 0.07s58.73% / 58.62%
Gradient Boosting (sample size = 100%, 16 stumps)31.09s / 0.11s43.50% / 43.57%
Gradient Boosting (sample size = 100%, 32 stumps)61.36s / 0.21s42.71% / 42.79%
Gradient Boosting (sample size = 100%, 64 stumps)118.30s / 0.39s15.08% / 15.31%
Gradient Boosting (sample size = 100%, 128 stumps)230.85s / 0.78s20.24% / 20.04%
Gradient Boosting (sample size = 25%)
Gradient Boosting (sample size = 25%, 2 stumps)2.17s / 0.03s46.67% / 46.51%
Gradient Boosting (sample size = 25%, 4 stumps)4.14s / 0.04s44.78% / 44.85%
Gradient Boosting (sample size = 25%, 8 stumps)8.10s / 0.07s43.36% / 43.28%
Gradient Boosting (sample size = 25%, 16 stumps)16.17s / 0.12s45.05% / 45.04%
Gradient Boosting (sample size = 25%, 32 stumps)31.77s / 0.22s22.70% / 22.88%
Gradient Boosting (sample size = 25%, 64 stumps)61.40s / 0.42s32.85% / 33.07%
Gradient Boosting (sample size = 25%, 128 stumps)100.16s / 0.65s49.26% / 49.25%
Random Forests (min samples per leaf = 1)
Random Forest (min samples per leaf = 1, 2 trees)2.57s / 0.03s90.58% / 73.65%
Random Forest (min samples per leaf = 1, 4 trees)5.07s / 0.07s96.53% / 79.84%
Random Forest (min samples per leaf = 1, 8 trees)10.20s / 0.13s98.88% / 83.20%
Random Forest (min samples per leaf = 1, 16 trees)20.47s / 0.27s99.66% / 84.79%
Random Forest (min samples per leaf = 1, 32 trees)40.82s / 0.54s99.95% / 85.78%
Random Forest (min samples per leaf = 1, 64 trees)82.00s / 1.07s99.99% / 86.11%
Random Forest (min samples per leaf = 1, 128 trees)165.42s / 2.17s100.00% / 86.40%
Random Forests (min samples per leaf = log n)
Random Forest (min samples per leaf = 11, 2 trees)2.17s / 0.03s87.35% / 80.06%
Random Forest (min samples per leaf = 11, 4 trees)4.38s / 0.07s89.78% / 82.80%
Random Forest (min samples per leaf = 11, 8 trees)8.75s / 0.13s91.41% / 84.13%
Random Forest (min samples per leaf = 11, 16 trees)17.35s / 0.25s92.00% / 84.94%
Random Forest (min samples per leaf = 11, 32 trees)35.08s / 0.51s92.47% / 85.51%
Random Forest (min samples per leaf = 11, 64 trees)70.10s / 1.02s92.56% / 85.74%
Random Forest (min samples per leaf = 11, 128 trees)139.47s / 2.01s92.60% / 85.84%
Random Forests (min samples per leaf = sqrt n)
Random Forest (min samples per leaf = 160, 2 trees)1.61s / 0.03s79.59% / 78.49%
Random Forest (min samples per leaf = 160, 4 trees)3.22s / 0.05s81.67% / 80.91%
Random Forest (min samples per leaf = 160, 8 trees)6.37s / 0.10s82.60% / 81.99%
Random Forest (min samples per leaf = 160, 16 trees)12.83s / 0.21s83.17% / 82.50%
Random Forest (min samples per leaf = 160, 32 trees)26.01s / 0.42s83.23% / 82.68%
Random Forest (min samples per leaf = 160, 64 trees)51.79s / 0.82s83.41% / 82.90%
Random Forest (min samples per leaf = 160, 128 trees)102.06s / 1.64s83.41% / 82.94%

Isolet

Recognize which letter is being spoken from hand-crafted auditory features. I don't really think this dataset is large enough to qualify as "medium-sized", but I still threw it in for comparison.
  • Dimensionality: 617
  • Training samples: 6,238
  • Test samples: 1,559
  • Number of classes: 26
  • Baseline test accuracy: 3.85%
  • Source: UCI
Algorithm Training / Test Time Training / Test Accuracy
Gaussian Naive Bayes0.46s / 0.58s84.23% / 80.12%
Regularized Gaussian Naive Bayes0.51s / 0.58s89.87% / 88.65%
LDA4.18s / 0.22s97.52% / 94.16%
QDAfailed
CART
Decision Tree (min samples per leaf = 1)4.52s / 0.00s100.00% / 79.92%
Decision Tree (min samples per leaf = log n = 9)4.39s / 0.00s95.70% / 79.54%
Decision Tree (min samples per leaf = sqrt n = 78)4.03s / 0.00s88.35% / 79.28%
Support Vector Machine
Linear SVM (SGD, ~250k updates / 41 epochs)9.12s / 0.06s99.90% / 95.00%
Linear SVM (SGD, ~1 updates / 161 epochs)35.12s / 0.07s100.00% / 94.48%
Linear SVM (SGD, ~2m updates / 321 epochs)70.07s / 0.06s100.00% / 94.36%
Linear SVM (liblinear)13.10s / 0.05s100.00% / 94.55%
Logistic Regression
Logisic Regression (SGD, ~250k updates / 41 epochs)19.39s / 0.06s99.98% / 95.00%
Logisic Regression (SGD, ~1m updates / 161 epochs)74.00s / 0.06s100.00% / 95.57%
Logisic Regression (SGD, ~2m updates / 321 epochs)147.53s / 0.06s100.00% / 95.57%
Logistic Regression (liblinear)42.67s / 0.05s99.97% / 95.51%
Extremely Randomized Trees
Extra-Trees (2 trees)2.52s / 0.01s100.00% / 66.77%
Extra-Trees (4 trees)4.75s / 0.02s100.00% / 79.09%
Extra-Trees (8 trees)9.34s / 0.03s100.00% / 87.94%
Extra-Trees (16 trees)18.39s / 0.06s100.00% / 92.05%
Extra-Trees (32 trees)36.38s / 0.11s100.00% / 93.46%
Extra-Trees (64 trees)72.55s / 0.22s100.00% / 94.74%
Extra-Trees (128 trees)144.06s / 0.44s100.00% / 95.38%
Gradient Boosting (sample size = 100%)
Gradient Boosting (sample size = 100%, 2 stumps)20.19s / 0.01s4.14% / 4.43%
Gradient Boosting (sample size = 100%, 4 stumps)40.19s / 0.01s3.91% / 4.87%
Gradient Boosting (sample size = 100%, 8 stumps)80.14s / 0.02s4.09% / 4.30%
Gradient Boosting (sample size = 100%, 16 stumps)160.47s / 0.03s3.77% / 4.62%
Gradient Boosting (sample size = 100%, 32 stumps)319.10s / 0.08s3.77% / 4.43%
Gradient Boosting (sample size = 100%, 64 stumps)630.17s / 0.13s3.82% / 4.36%
Gradient Boosting (sample size = 100%, 128 stumps)1251.35s / 0.25s3.69% / 3.85%
Gradient Boosting (sample size = 25%)
Gradient Boosting (sample size = 25%, 2 stumps)10.08s / 0.01s3.70% / 3.98%
Gradient Boosting (sample size = 25%, 4 stumps)19.79s / 0.01s3.72% / 3.85%
Gradient Boosting (sample size = 25%, 8 stumps)39.54s / 0.02s3.49% / 3.66%
Gradient Boosting (sample size = 25%, 16 stumps)63.67s / 0.03s3.85% / 3.85%
Gradient Boosting (sample size = 25%, 32 stumps)83.68s / 0.02s3.85% / 3.85%
Gradient Boosting (sample size = 25%, 64 stumps)124.81s / 0.04s3.85% / 3.85%
Gradient Boosting (sample size = 25%, 128 stumps)203.56s / 0.05s3.85% / 3.85%
Random Forests (min samples per leaf = 1)
Random Forest (min samples per leaf = 1, 2 trees)2.60s / 0.01s88.12% / 66.32%
Random Forest (min samples per leaf = 1, 4 trees)5.45s / 0.02s97.74% / 81.46%
Random Forest (min samples per leaf = 1, 8 trees)10.46s / 0.03s99.71% / 88.65%
Random Forest (min samples per leaf = 1, 16 trees)21.04s / 0.05s99.98% / 92.05%
Random Forest (min samples per leaf = 1, 32 trees)42.44s / 0.11s100.00% / 93.52%
Random Forest (min samples per leaf = 1, 64 trees)83.37s / 0.21s100.00% / 94.10%
Random Forest (min samples per leaf = 1, 128 trees)168.74s / 0.43s100.00% / 93.97%
Random Forests (min samples per leaf = log n)
Random Forest (min samples per leaf = 9, 2 trees)2.32s / 0.01s87.48% / 76.78%
Random Forest (min samples per leaf = 9, 4 trees)4.66s / 0.01s93.78% / 85.63%
Random Forest (min samples per leaf = 9, 8 trees)9.32s / 0.03s96.73% / 89.35%
Random Forest (min samples per leaf = 9, 16 trees)18.82s / 0.05s98.12% / 92.62%
Random Forest (min samples per leaf = 9, 32 trees)37.40s / 0.10s98.59% / 93.33%
Random Forest (min samples per leaf = 9, 64 trees)75.41s / 0.21s98.75% / 94.03%
Random Forest (min samples per leaf = 9, 128 trees)152.74s / 0.44s98.86% / 94.16%
Random Forests (min samples per leaf = sqrt n)
Random Forest (min samples per leaf = 78, 2 trees)2.01s / 0.01s73.87% / 69.53%
Random Forest (min samples per leaf = 78, 4 trees)4.02s / 0.01s82.57% / 78.51%
Random Forest (min samples per leaf = 78, 8 trees)8.00s / 0.02s88.14% / 85.05%
Random Forest (min samples per leaf = 78, 16 trees)15.87s / 0.05s90.85% / 88.65%
Random Forest (min samples per leaf = 78, 32 trees)31.94s / 0.11s91.66% / 90.64%
Random Forest (min samples per leaf = 78, 64 trees)64.22s / 0.20s92.29% / 90.38%
Random Forest (min samples per leaf = 78, 128 trees)128.47s / 0.40s92.40% / 90.96%

1 comment:

  1. What's meant by "Training error"?
    (Test error is measured by cross validation I assume, or a hold out set?).

    (Why not use ~5-10 fold CV and just report the accuracy or F1 from that?)

    ReplyDelete