When I'm constructing a classification pipeline, there are usually a great number of choices to be made about feature selection/transformation/extraction. I often find myself thinking things like:
While you can choose and manipulate your features by divination, tea leaf reading, or good old-fashioned gut-following, it's better to actually check whether what you're doing improves performance on a validation set. The trouble is that repeatedly training a classifier can become cumbersome once you leave the cozy world of Small Data. Even worse, if you want to use some sort of automatic feature selection algorithm (which requires training hundreds or thousands of classifiers), you might find yourself waiting for a very very long time.
So which classifiers can I realistically use for exploring medium-sized data sets (those small enough to fit in memory, but too big for something silly like constructing a kernel matrix)? I tested a variety of scikit-learn's classifiers to see how they compare.
In addition to checking the runtime and accuracy of each classifier, I also wanted to know how well they cope with noisy/irrelevant features. To this end, I made a copy of the data with 258 of the 500 features replaced by uniform noise.
- "Is number of burps a useful feature?"
- "Should I take the log of lollipops per hectare?"
- "I don't know what structure this time series actually has, so fuck it, I'll try wavelets..."
- "Andrew Ng told me learning a sparse overcomplete feature representation might work..."
While you can choose and manipulate your features by divination, tea leaf reading, or good old-fashioned gut-following, it's better to actually check whether what you're doing improves performance on a validation set. The trouble is that repeatedly training a classifier can become cumbersome once you leave the cozy world of Small Data. Even worse, if you want to use some sort of automatic feature selection algorithm (which requires training hundreds or thousands of classifiers), you might find yourself waiting for a very very long time.
So which classifiers can I realistically use for exploring medium-sized data sets (those small enough to fit in memory, but too big for something silly like constructing a kernel matrix)? I tested a variety of scikit-learn's classifiers to see how they compare.
Data
I generated 50,000 samples from 10 distinct 500-dimensional gaussians (with random means and random diagonal covariance matrices), and randomly assigned each sample to either a test or training set. This yielded a data set with 10 classes, 500 dimensions, 24,853 training samples and 25,147 test samples.In addition to checking the runtime and accuracy of each classifier, I also wanted to know how well they cope with noisy/irrelevant features. To this end, I made a copy of the data with 258 of the 500 features replaced by uniform noise.
Results
Algorithm | Training/Test Time | Training/Test Accuracy | Corrupt Feature Accuracy |
---|---|---|---|
Linear Discriminant Analysis | 9.16s / 0.23s | 51.64% / 46.19% | 37.39% / 30.96% |
Quadratic Discriminant Analysis | 14.78s / 34.27s | 100% / 100% | 100% / 100% |
Linear SVM (trained with liblinear) | 449.9s / 0.49s | 38.81% / 34.95% | 25.26% / 21.27% |
Linear SVM (trained with SGD) | 3.8s / 0.13s | 50.13% / 43.68% | 24.4% / 21.38% |
L1-penalized SVM (trained with SGD) | 27.49 s / 0.13s | 50.29% / 43.39% | 33.81% / 26.5% |
Logistic Regression | 19.74s / 0.49s | 51.69% / 45.56% | 37.24% / 30.37% |
Gaussian Naive Bayes | 0.73s / 5.14s | 100% / 100% | 100% / 100% |
Decision Tree (CART) | 20.4s / 0.1s | 100% / 61% | 100% / 58.9% |
Random Forest (10 trees) | 49.45s / 1.01s | 99.96% / 84.32% | 99.93% / 74.2% |
Random Forest (20 trees) | 98.62s / 2.02s | 100% / 95.2% | 100% / 88.8% |
Random Forest (40 trees) | 197.56s / 4.03s | 100% / 99.2% | 100% / 96.75% |
Random Forest (80 trees) | 393.5s / 8.08s | 100% / 99.96% | 100% / 99.26% |
Random Forest (160 trees) | 789.2s / 16.01s | 100% / 100% | 100% / 99.82% |
Extremely Randomized Tree | 6.58s / 0.1s | 100% / 35.4.6% | 100% / 29% |
Extra Trees (10 trees) | 57.94s / 1.04s | 100% / 72.6% | 100% / 55.4% |
Extra Trees (20 trees) | 110.84s / 2.12s | 100% / 90.1% | 100% / 72.56% |
Extra Trees (40 trees) | 221.43s / 4.23s | 100% / 98.16% | 100% / 88.99% |
Extra Trees (80 trees) | 441.7s / 8.46s | 100% / 99.87% | 100% / 88.99% |
Extra Trees (160 trees) | 878.2s / 16.84s | 100% / 100% | 100% / 99.4% |
Gradient Boosting (4 trees per class) | 59.93s / 0.23s | 82.9% / 80.59% | 75.22% / 72.91% |
Gradient Boosting (8 trees per class) | 118.66s / 0.36s | 93.85% / 92.59% | 89.04% / 87.19% |
Gradient Boosting (16 trees per class) | 234.78s / 0.61s | 98.75% / 97.94% | 96.28% / 94.72% |
Gradient Boosting (32 trees per class) | 464.89s / 1.11s | 99.9% / 99.72% | 99.2% / 98.52% |
Observations
- If you know the process by which your data was generated then consider your work done. In this case, the data was sampled from 10 Gaussian distributions and (unsurprisingly) QDA and Gaussian Naive Bayes both do very well. If I had used Gaussians with non-zero off-diagonal covariances (interactions between features), then I would expect Gaussian Naive Bayes to do worse but QDA would still capture the best possible class boundaries.
- SVM and Logistic Regression are very sensitive to the actual margin between classes. I retrained the linear learners with better separated class means and found their test accuracy went up to ~99% (whereas tree-based algorithms only improved slightly).
- Stochastic gradient descent for L2-regularized learners is blazing fast. Kudos to Peter Prettenhofer for implementing something which makes every other linear learner look like a slug.
- Though still fast compared to the glacial alternative of using a convex solver, SGD for L1-penalized learning is noticeably slower.
- As advertised, a single Extremely Randomized Tree is terrible by itself. An Extra Tree ensemble, however, seems competitive with Random Forests for clean data, but shows worse degradation when trained on corrupted features. This makes sense, since a Random Forest's tree-growing algorithm will get a chance to skip useless features (and an Extremely Randomized Tree uses all features with equal probability).
- Only Naive Bayes and SGD-SVM are cheap enough for use with greedy feature selection.
Next Time
- Since the base learners of a Random Forest took ~5 seconds to train, can random forests for 2 or 3 trees give good enough accuracy to be useful?
- What if we use smaller bootstrap samples (train each tree on 10% or even 1% of the data), how badly will accuracy suffer?
- Do any of the "expensive" algorithms become cheaper on a differently shaped data set (more rows, fewer features)?
- The SGD learners above performed about ~1 million weight updates (the suggested heuristic from the sklearn page). Presumably I could get a 10X performance gain by using only 10^5 updates. How badly will this impact generalization/test error?
- Simulated Gaussian data is totally unrepresentative! How do these algorithms behave on real data? I'm going to try them out on some intra-day currency time series and some other data we found lying around on kaggle.