Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nice visualization! This provides me an opportunity to go on a random tangent on PCA:

The post considers PCA from visualization perspective, but the exactly same thing can also be viewed as a method for reducing number of dimensions in the original dataset. [1] Now, one of the interesting questions in a dimensionality reduction task is, how to pick the number of dimensions (principal components)? A good number? In a principled way, instead of just computing the next component and the next and the one after that, until you get bored? (It works for visualizations where you often want only the first two or three components anyway, but suppose we want more information than plots.)

I recently learned that there's a fascinating way to do this, presented in Bishop's paper [2] from 1999. In short: this can answered by recasting the PCA as Bayesian latent variable model with a hierarchical prior. (Yes, it is a bit of mouthful to say. Yes, it is fairly mathematical, unlike the visualization.)

[1] https://en.wikipedia.org/wiki/Dimensionality_reduction

[2] https://www.microsoft.com/en-us/research/publication/bayesia...



I always used Jackson's broken-stick method for determining the number of dimensions to retain.

http://onlinelibrary.wiley.com/doi/10.2307/1939574/abstract


Frank Harrell demonstrates a way to use PCA to reduce dimensionality with regression modeling strategies. His Course Notes (pdf) is a good reference point for multiple strategies on regression.

http://biostat.mc.vanderbilt.edu/wiki/Main/RmS


you can look at it in terms of reconstruction error


Yeah PCA will give you eigenvalues of the PCs in descending order of variance explained so just summing and weighting those tells you the first 3 PCs explain say 93% of the variance in the data.


But the question is when exactly to stop, because the reconstruction error is always going to get lower. The bayesian solution, IIUC, is something like "stop when information required to store the new PCs is more than the information gained from the reduced reconstruction error."

The only issue with this is, if you get tons of data then there will be less uncertainty in the principal components. And so it will recommend as many as possible, even if they only decrease the reconstruction error a tiny bit.


This is the standard way when looking for reduced order models in fluid mechanics.

Variations on this: i) How 'faithfully' does it represent the data, eg, how many modes (components) are needed to resolve accuracy in a particular metric, or the entire system ii) What is the cut-off component number which has a signal of order of the measurement uncertainty.


He, I just happened to have MATLAB on my other screen performing a PCA of a CFD simulation. I got bored of watching the progress bar and decided to browse HN a bit. I guess I cannot escape...




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: