Cool work! Something that worries me with PCA is that it's designed to retain variance, but variance might nor be the right metric for the semantics we want to retrieve. Ditto UMAP/tSNE that retains distances in lower dimensionality... If semantics are mostly encoded as directions on subsets of dimensions, PCA and friends would be too blunt of a tool... I wonder if a better approach would be linear probes or other decoders for a wide range of the concepts one wants to retrieve, and then optimize compression while keeping those retrievals as high as possible... i.e. tune the compressor to the usecase, like MP3 or MPEG do.
In the article, you mention this approach requires no search over hyper-parameter, because the method comprises a closed-form solution with "simple" linear algebra. I agree with this, but do you not in think need to tune the L2-regularization strength? That would for me be a hyper-parameter you would need to do a CV over (or similarly).
Fair point — lam is technically a hyperparameter. In practice I used lam=1e-3 (the default in the code) across all four models without tuning, and the gap to PCA is robust enough that small variations don't change the conclusion. So more accurately: "one hyperparameter with a benign default" rather than "no hyperparameters" — you're right I overstated.
Looking at your experiment code, it seems like the retrieval experiments are done with the reconstructed vectors of dimension D rather than the compressed vectors of dimension d, which doesn't have any direct performance improvements. Later on in the post you indicate that the real advantage is that the residuals are more isotropic and therefore you can quantize the pair (p, V_resid) with less quality degradation, but I don't see any experiments actually verifying that retrieval quality holds up in this setting. Also, it's not quite clear to me how you efficiently compute cosine similarity for vectors encoded in this form. Doesn't the V_resid part of the computation require something significantly more complex than a dot product?
I agree that we don't want to reconstruct the whole vector while retrieval and it makes poly-AE toy-like at the current state non production ready. My main interest here in the just taking more recall pp in closed form. And then think about how to make it fast. In all threads I got a good intermediate thoughts about the topic which may help me to bring to closer to production form
You should benchmark the retrieval speed of each method in terms of queries per second. I suspect that the gain in bandwidth you get from slightly better compression will be defeated by decompression being much more expensive.
I think your per-axis std normalization is likely doing a big pile of the work —- it’s fairly well-known that “wrong” PCA, setting sigma=Id or just taking a square root, gives better embeddings than the un-normalized version. It would be worth showing a comparison to similarly-normalized PCA I think, if it’s not too hard?
Good catch, this is the obvious ablation I should have included. I'll re-run with per-axis normalized PCA as a separate baseline and post numbers in this thread tomorrow.
Prior: I expect some of the gap to come from normalization, but not all — the no-improvement results on isotropic datasets (§4) suggest there's structural signal the polynomial cross-terms catch that linear projection structurally can't. But that's a prediction; let me actually run it.
Just checked the normalization point. You were partially right, sqrt-normalization makes the difference x2 less. I'm updating the numbers in the post.
Interesting moment. I did a smoke test of poly-AE without whitening, and the result didn't change. I won't mention it in the post cause right now I'm not sure if it's a random effect or really a polynomial lift compensates normalization
Cool idea. But it only works when the data never changes.
could you make a streaming/incremental version? One that updates the math cheaply when new data arrives, instead of recomputing everything, or does the math fundamentally prevent it?