When I'm constructing a classification pipeline, there are usually a great number of choices to be made about feature selection/transformation/extraction. I often find myself thinking things like:
While you can choose and manipulate your features by divination, tea leaf reading, or good old-fashioned gut-following, it's better to actually check whether what you're doing improves performance on a validation set. The trouble is that repeatedly training a classifier can become cumbersome once you leave the cozy world of Small Data. Even worse, if you want to use some sort of automatic feature selection algorithm (which requires training hundreds or thousands of classifiers), you might find yourself waiting for a very very long time.
So which classifiers can I realistically use for exploring medium-sized data sets (those small enough to fit in memory, but too big for something silly like constructing a kernel matrix)? I tested a variety of scikit-learn's classifiers to see how they compare.
In addition to checking the runtime and accuracy of each classifier, I also wanted to know how well they cope with noisy/irrelevant features. To this end, I made a copy of the data with 258 of the 500 features replaced by uniform noise.
- "Is number of burps a useful feature?"
- "Should I take the log of lollipops per hectare?"
- "I don't know what structure this time series actually has, so fuck it, I'll try wavelets..."
- "Andrew Ng told me learning a sparse overcomplete feature representation might work..."
While you can choose and manipulate your features by divination, tea leaf reading, or good old-fashioned gut-following, it's better to actually check whether what you're doing improves performance on a validation set. The trouble is that repeatedly training a classifier can become cumbersome once you leave the cozy world of Small Data. Even worse, if you want to use some sort of automatic feature selection algorithm (which requires training hundreds or thousands of classifiers), you might find yourself waiting for a very very long time.
So which classifiers can I realistically use for exploring medium-sized data sets (those small enough to fit in memory, but too big for something silly like constructing a kernel matrix)? I tested a variety of scikit-learn's classifiers to see how they compare.
Data
I generated 50,000 samples from 10 distinct 500-dimensional gaussians (with random means and random diagonal covariance matrices), and randomly assigned each sample to either a test or training set. This yielded a data set with 10 classes, 500 dimensions, 24,853 training samples and 25,147 test samples.In addition to checking the runtime and accuracy of each classifier, I also wanted to know how well they cope with noisy/irrelevant features. To this end, I made a copy of the data with 258 of the 500 features replaced by uniform noise.
Results
Algorithm | Training/Test Time | Training/Test Accuracy | Corrupt Feature Accuracy |
---|---|---|---|
Linear Discriminant Analysis | 9.16s / 0.23s | 51.64% / 46.19% | 37.39% / 30.96% |
Quadratic Discriminant Analysis | 14.78s / 34.27s | 100% / 100% | 100% / 100% |
Linear SVM (trained with liblinear) | 449.9s / 0.49s | 38.81% / 34.95% | 25.26% / 21.27% |
Linear SVM (trained with SGD) | 3.8s / 0.13s | 50.13% / 43.68% | 24.4% / 21.38% |
L1-penalized SVM (trained with SGD) | 27.49 s / 0.13s | 50.29% / 43.39% | 33.81% / 26.5% |
Logistic Regression | 19.74s / 0.49s | 51.69% / 45.56% | 37.24% / 30.37% |
Gaussian Naive Bayes | 0.73s / 5.14s | 100% / 100% | 100% / 100% |
Decision Tree (CART) | 20.4s / 0.1s | 100% / 61% | 100% / 58.9% |
Random Forest (10 trees) | 49.45s / 1.01s | 99.96% / 84.32% | 99.93% / 74.2% |
Random Forest (20 trees) | 98.62s / 2.02s | 100% / 95.2% | 100% / 88.8% |
Random Forest (40 trees) | 197.56s / 4.03s | 100% / 99.2% | 100% / 96.75% |
Random Forest (80 trees) | 393.5s / 8.08s | 100% / 99.96% | 100% / 99.26% |
Random Forest (160 trees) | 789.2s / 16.01s | 100% / 100% | 100% / 99.82% |
Extremely Randomized Tree | 6.58s / 0.1s | 100% / 35.4.6% | 100% / 29% |
Extra Trees (10 trees) | 57.94s / 1.04s | 100% / 72.6% | 100% / 55.4% |
Extra Trees (20 trees) | 110.84s / 2.12s | 100% / 90.1% | 100% / 72.56% |
Extra Trees (40 trees) | 221.43s / 4.23s | 100% / 98.16% | 100% / 88.99% |
Extra Trees (80 trees) | 441.7s / 8.46s | 100% / 99.87% | 100% / 88.99% |
Extra Trees (160 trees) | 878.2s / 16.84s | 100% / 100% | 100% / 99.4% |
Gradient Boosting (4 trees per class) | 59.93s / 0.23s | 82.9% / 80.59% | 75.22% / 72.91% |
Gradient Boosting (8 trees per class) | 118.66s / 0.36s | 93.85% / 92.59% | 89.04% / 87.19% |
Gradient Boosting (16 trees per class) | 234.78s / 0.61s | 98.75% / 97.94% | 96.28% / 94.72% |
Gradient Boosting (32 trees per class) | 464.89s / 1.11s | 99.9% / 99.72% | 99.2% / 98.52% |
Observations
- If you know the process by which your data was generated then consider your work done. In this case, the data was sampled from 10 Gaussian distributions and (unsurprisingly) QDA and Gaussian Naive Bayes both do very well. If I had used Gaussians with non-zero off-diagonal covariances (interactions between features), then I would expect Gaussian Naive Bayes to do worse but QDA would still capture the best possible class boundaries.
- SVM and Logistic Regression are very sensitive to the actual margin between classes. I retrained the linear learners with better separated class means and found their test accuracy went up to ~99% (whereas tree-based algorithms only improved slightly).
- Stochastic gradient descent for L2-regularized learners is blazing fast. Kudos to Peter Prettenhofer for implementing something which makes every other linear learner look like a slug.
- Though still fast compared to the glacial alternative of using a convex solver, SGD for L1-penalized learning is noticeably slower.
- As advertised, a single Extremely Randomized Tree is terrible by itself. An Extra Tree ensemble, however, seems competitive with Random Forests for clean data, but shows worse degradation when trained on corrupted features. This makes sense, since a Random Forest's tree-growing algorithm will get a chance to skip useless features (and an Extremely Randomized Tree uses all features with equal probability).
- Only Naive Bayes and SGD-SVM are cheap enough for use with greedy feature selection.
Next Time
- Since the base learners of a Random Forest took ~5 seconds to train, can random forests for 2 or 3 trees give good enough accuracy to be useful?
- What if we use smaller bootstrap samples (train each tree on 10% or even 1% of the data), how badly will accuracy suffer?
- Do any of the "expensive" algorithms become cheaper on a differently shaped data set (more rows, fewer features)?
- The SGD learners above performed about ~1 million weight updates (the suggested heuristic from the sklearn page). Presumably I could get a 10X performance gain by using only 10^5 updates. How badly will this impact generalization/test error?
- Simulated Gaussian data is totally unrepresentative! How do these algorithms behave on real data? I'm going to try them out on some intra-day currency time series and some other data we found lying around on kaggle.
Nice post, thanks!
ReplyDeleteIn your results, what exactly do you mean by training/test accuracy? Isn't accuracy on the training set essentially meaningless as you are exposing yourself to overfitting?
ReplyDeleteAlso here is another classic source of sample data to analyze:
http://archive.ics.uci.edu/ml/
Hi George,
DeleteYou're right in that training accuracy doesn't tell you anything useful about whether your model has achieved generalization (as opposed to just overfitting/memorizing the data). It does, however, usually provide an upper bound for how well you might do on unseen data. Looking at both training and test accuracy lets you figure out on which side of the bias/variance tradeoff a particular model falls.
If I have low test accuracy but high training accuracy then my hypothesis class is probably too rich/powerful for my data and I have overfit my training data. If, on the other hand, both are low then I'm probably suffering from some mismatch between the data and a model's bias.
Also, thanks for UCI link. For the next post, I'm going to include one or two of their larger datasets (along with one from kaggle and some finance data I have been working with).
It's really hard to rigorously compare learning models but I'm hoping to get a rough view of which algorithms get "good enough" performance without taking too much time.
Also, here's a pretty good lecture on the bias/variance tradeoff: http://youtu.be/zrEyxfl2-a8?t=8m
DeleteCool, I look forward to seeing your next post.
DeleteNice blog post. quick question: which implementation of LogisticRegression are you using? The one based on liblinear (aka sklearn.linear_model.LogisticRegression) or a sklearn.linear_model.SGDClassifier model with `loss=log`?
ReplyDeleteFYI I also submitted the link of your blog post on HN here:
http://news.ycombinator.com/item?id=4153549
Hey Olivier,
DeleteGlad you liked it. I used liblinear for Logistic Regression, though now I'm curious to see how much faster the SGD version would be (I'll try comparing them for the next post). Do you have any insight as to why liblinear is reasonably speedy for logistic regression but stalls when training an SVM?
Also, I just saw that LogisticRegression's multiclass classification is implemented via One-vs-All training, which is a little hackish (since I assume the models aren't really calibrated relative to one another). Do you know if there's an implementation of a proper MaxEnt multinomial regression with python bindings?
For the difference between the SVM Hinge-loss and the LogisticRegression Log-loss in the liblinear model wrapped in scikit-learn one would have to check the tolerance and algorithms (dual vs primal) used with the default settings in both cases (I am too lazy to check now :). Also maybe the hinge loss is more sensitive to non-normalized data (although your dataset don't seem the be particularly ill-conditioned besides the non-linear separability).
DeleteI think I heard on the mailing list that one scikit-learn developer (either @pprett or @mblondel) has a real multinomial log loss for the SGDClassifier lying around in a branch somewhere. As usual it is "just" a matter of cleaning it up, documenting it and testing it... Also there is a GSoC project on Multi Layer Perceptron with SGD and the cross entropy loss will show up this way in scikit-learn as well. It don't think it matters that much in practice for linear models though, unless you are really interested in the predicted probabilities.
Would you mind posting your code, I would like to try playing with your implementation. Thanks!
ReplyDeleteHey Maverick,
ReplyDeleteI think the exact code I used is lost in an unsaved ipython notebook but there's an earlier version of it here: https://github.com/iskandr/data-experiments/blob/master/timing.py
This comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete