explain my data: Should you apply PCA to your data?

Wednesday, July 11, 2012

Should you apply PCA to your data?

If you've ever dipped your toe into the cold & murky pool of data processing, you've probably heard of principal component analysis (PCA). PCA is a classy way to reduce the dimensionality of your data, while (purportedly) keeping most of the information. It's ubiquitously well-regarded.

But is it actually a good idea in practice? Should you apply PCA to your data before, for example, learning a classifier? This post will take a small step in the direction of answering this question.

I was inspired to investigate PCA by David MacKay's amusing response to an Amazon review lamenting PCA's absence in MacKay's book:

"Principal Component Analysis" is a dimensionally invalid method that gives people a delusion that they are doing something useful with their data. If you change the units that one of the variables is measured in, it will change all the "principal components"! It's for that reason that I made no mention of PCA in my book. I am not a slavish conformist, regurgitating whatever other people think should be taught. I think before I teach. David J C MacKay.

Ha! He's right, of course. Snarky, but right. The results of PCA depend on the scaling of your data. If, for example, your raw data has one dimension that is on the order of $10^2$ and another on the order of $10^6$, you may run into trouble.

Exploring this point, I'm going to report test classification accuracy before & after applying PCA as a dimensionality reduction technique. Since, as MacKay points out, variable normalizations are important, I tried each of the following combinations of normalization & PCA before classification:

None.
PCA on the raw data.
PCA on sphered data (each dimension has mean 0, variance 1).
PCA on 0-to-1 normalized data (each dimension is squished to be between 0 and 1).
ZCA-whitening on the raw data (a rotation & scaling that results in identity covariance).
PCA on ZCA-whitened data.

Some experimental details:

Each RF was trained to full depth with $100$ trees and $\sqrt{d}$ features sampled at each split. I used MATLAB & this RF package.
The random forest is not sensitive to any dimension-wise normalizations, and that's why I don't bother comparing RF on the raw data to RF on standard normalized & 0-1 normalized data.\The performance is identical! (That's one of many reasons why we <3 random forests).
PCA in the above experiments is always applied as a dimensionality reduction technique - the principal components that explain 99% of the variance are kept, and the rest are thrown out (see details here).
ZCA is usually used as normalization (and not as dimensionality reduction). Rotation does affect the RF, and that's why experiment (5) is included.
PCA and ZCA require the data to have zero mean.
The demeaning, PCA/ZCA transformations, and classifier training were all done on the training data only, and then applied to the held-out test data.

And now, the results in table form.

Dataset	Raw Accuracy	PCA	Sphere + PCA	0-to-1 + PCA	ZCA	ZCA + PCA
proteomics	86.3%	65.4%	83.2%	82.8%	84.3%	82.8%
dnasim	83.1%	75.6%	73.8%	81.5%	85.8%	86.3%
isolet	94.2%	88.6%	87.9%	88.6%	74.2%	87.9%
usps	93.7%	91.2%	90.4%	90.5%	88.2%	88.4%
covertype	86.8%	94.5%	93.5%	94.4%	94.6%	94.5%

Observations:

Applying PCA to the raw data can be disastrous. The proteomics dataset has all kinds of wacky scaling issues, and it shows. Nearly 20% loss in accuracy!
For dnasim, choice of normalization before PCA is significant, but not so much for the other datasets. This demonstrates MacKay's point. In other words: don't just sphere like a "slavish conformist"! Try other normalizations.
Sometimes rotating your data can create problems. ZCA keeps all the dimensions & the accuracy still drops for proteomics, isolet, and USPS. Probably because a bunch of the very noisy dimensions are mixed in with all the others, effectively adding noise where there was little before.
Try ZCA and PCA - you might get a fantastic boost in accuracy. The covertype accuracy in this post is better than every covertype accuracy Alex reported in his previous post.
I also ran these experiments with a 3-tree random forest, and the above trends are still clear. In other words, you can efficiently figure out which combo of normalization and PCA/ZCA is right for your dataset.

There is no simple story here. What these experiments have taught me is (i) don't apply PCA or ZCA blindly but (ii) do try PCA and ZCA, they have the potential to improve performance significantly. Validate your algorithmic choices!

Addendum: a table with dimensionality before & after PCA with various normalizations:

Dataset	Original Dimension	PCA	Sphere + PCA	0-to-1 + PCA	ZCA + PCA
proteomics	109	3	57	24	65
dnasim	403	2	1	3	13
isolet	617	381	400	384	606
usps	256	168	190	168	253
covertype	54	36	48	36	49

17 comments:

MaverickJuly 11, 2012 at 3:41 PM
would you mind posting your code?
ReplyDelete
Replies
Brooks PaigeJuly 11, 2012 at 3:54 PM
You reduce the dimension using PCA by keeping only as many eigenvectors as needed to explain 99% of the variance -- what's the dimensionality, then, of the transformed data? How much lower is it than dimensionality of the greedy forward feature selection in your last post?
ReplyDelete
Replies
sergeyJuly 11, 2012 at 5:41 PM
maverick: it's kind of a mess! i can post individual functions or datasets, if there's something in particular you're interested in.

brooks: good question. i'll run some experiments later & make an addendum to the post.
ReplyDelete
Replies
Alex RJuly 12, 2012 at 5:51 AM
Piotr: On the other hand, PCA might combine several noisy redundant features into a single axis, which could potentially be beneficial. I don't think it's possible to say what effect PCA will have without reference to particular data.

Sergey: It *seems* (very dangerous word) that linear classifiers should be affected by rotations in the feature space differently that axis-aligned thresholders (aka, decision trees). Any chance you'll try the same experiments with a linear SVM?
ReplyDelete
Replies
AmyJuly 12, 2012 at 5:55 PM
What was the dimensionality of those various datasets? I wonder if PCA is more helpful if your number of dimensions is large with respect to the size of your dataset??
ReplyDelete
Replies
JohnJuly 24, 2012 at 2:38 AM
I hate how the answer with these things always seems to be "do it all the ways then validate"...

work, work, work :-)
ReplyDelete
Replies
kurAugust 31, 2012 at 12:44 PM
If you have a good predictor in your dataset and another variable, which is highly correlated to the good predictor, both will be projected to the same dimension and noise is added to the good predictor. It´s like blurring the "good" variable
I generated a toy dataset which illustrates that problem (btw. my post was inspired by yours):
http://machine-master.blogspot.de/2012/08/pca-or-polluting-your-clever-analysis.html
ReplyDelete
Replies
etertertMay 14, 2014 at 8:57 PM
The results for PCA on 0-to-1 normalized data also looks encouraging and I am wondering why it wasn't mentioned. It achieves much better dimensionality reduction with relatively comparable accuracy rates (to ZCA+PCA).
ReplyDelete
Replies
AnonymousJanuary 2, 2015 at 1:52 AM
ZCA is usually used as normalization. Rotation does affect the RF. By the way, Your blog is really informative. Thanks for sharing.Visit college paper for best papers.
ReplyDelete
Replies
Nickole DinardoNovember 9, 2015 at 5:02 AM
The stats within statistics report gives you more understand then you almost becomes more able to produce your ideas by yourself, so looking forward to influence more the ideas within this. statistical analysis with missing data
ReplyDelete
Replies
KannanAugust 6, 2016 at 6:49 AM
when did uase pca in data and what kind of data set using pac
ReplyDelete
Replies
AhamedApril 7, 2017 at 2:03 AM
Depending on the academic institution a student researcher is associated with, graduate level research can be defined in a variety of ways. Some institutions have rigorous guidelines you must abide by while other academic entities have more relaxed rules.useful site

ReplyDelete
Replies
UnknownMay 25, 2017 at 11:36 AM
Hi Sergey,

Nice work. It seems pre-processing before PCA helps in obtaining the useful max variance dimensions. I am curious about one thing - what happens to the performance comparison if you extract n/2 dimensions (n is the # of original dimensions) for each pre-processing case? It might actually be useful to plot out the performance curves for each of these cases, while mentioning the 99% variance caps, to get the full picture. It might be so that ZCA+PCA is 'steadier' in choosing dimensions (# of dimensions extracted is consistently higher) and might perform worse if restricted to lesser amount of dimensions.
ReplyDelete
Replies
Charlie RedmonJune 22, 2017 at 6:32 PM
Hi Sergey,

Wonderful article! I'm curious about your accuracy estimation. Is this prediction accuracy based on a held-out test set or cross-validation, and if not, what do you think the impact of PCA and other dimensionality reduction techniques would be in this case? It seems to me that multicollinearity in the predictors should make the model less stable on repeated subsamples of the data (or at generalizing to new data), which could be one motivation for PCA outside of dimensionality reduction, and I wonder if the added noise introduced by PCA may have some positive attributes in improving stability of the model, but I have not run any simulations to test these hypotheses so I'd love to know your thoughts before I delve further into the problem.
ReplyDelete
Replies
who's your data!!June 27, 2017 at 9:06 AM
Im working on a Kaggle competition dataset and applying PCA has destroyed my accuracy....i used both princomp and prcomp in R and ran a Random Forest model.....rpart and knn show equally bad results as does the ensemble model.
ReplyDelete
Replies

Add comment