| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pleshkov 45 days ago
	Author here — questions and pushback both welcome.

8 comments

brunosan 42 days ago

Cool work! Something that worries me with PCA is that it's designed to retain variance, but variance might nor be the right metric for the semantics we want to retrieve. Ditto UMAP/tSNE that retains distances in lower dimensionality... If semantics are mostly encoded as directions on subsets of dimensions, PCA and friends would be too blunt of a tool... I wonder if a better approach would be linear probes or other decoders for a wide range of the concepts one wants to retrieve, and then optimize compression while keeping those retrievals as high as possible... i.e. tune the compressor to the usecase, like MP3 or MPEG do.

link

Devilstro 42 days ago

In the article, you mention this approach requires no search over hyper-parameter, because the method comprises a closed-form solution with "simple" linear algebra. I agree with this, but do you not in think need to tune the L2-regularization strength? That would for me be a hyper-parameter you would need to do a CV over (or similarly).

link

pleshkov 42 days ago

Fair point — lam is technically a hyperparameter. In practice I used lam=1e-3 (the default in the code) across all four models without tuning, and the gap to PCA is robust enough that small variations don't change the conclusion. So more accurately: "one hyperparameter with a benign default" rather than "no hyperparameters" — you're right I overstated.

link

aesthesia 42 days ago

Looking at your experiment code, it seems like the retrieval experiments are done with the reconstructed vectors of dimension D rather than the compressed vectors of dimension d, which doesn't have any direct performance improvements. Later on in the post you indicate that the real advantage is that the residuals are more isotropic and therefore you can quantize the pair (p, V_resid) with less quality degradation, but I don't see any experiments actually verifying that retrieval quality holds up in this setting. Also, it's not quite clear to me how you efficiently compute cosine similarity for vectors encoded in this form. Doesn't the V_resid part of the computation require something significantly more complex than a dot product?

link

pleshkov 40 days ago

I agree that we don't want to reconstruct the whole vector while retrieval and it makes poly-AE toy-like at the current state non production ready. My main interest here in the just taking more recall pp in closed form. And then think about how to make it fast. In all threads I got a good intermediate thoughts about the topic which may help me to bring to closer to production form

link

yorwba 42 days ago

You should benchmark the retrieval speed of each method in terms of queries per second. I suspect that the gain in bandwidth you get from slightly better compression will be defeated by decompression being much more expensive.

link

dbfclark 42 days ago

I think your per-axis std normalization is likely doing a big pile of the work —- it’s fairly well-known that “wrong” PCA, setting sigma=Id or just taking a square root, gives better embeddings than the un-normalized version. It would be worth showing a comparison to similarly-normalized PCA I think, if it’s not too hard?

link

pleshkov 42 days ago

Good catch, this is the obvious ablation I should have included. I'll re-run with per-axis normalized PCA as a separate baseline and post numbers in this thread tomorrow. Prior: I expect some of the gap to come from normalization, but not all — the no-improvement results on isotropic datasets (§4) suggest there's structural signal the polynomial cross-terms catch that linear projection structurally can't. But that's a prediction; let me actually run it.

link

pleshkov 40 days ago

Just checked the normalization point. You were partially right, sqrt-normalization makes the difference x2 less. I'm updating the numbers in the post. Interesting moment. I did a smoke test of poly-AE without whitening, and the result didn't change. I won't mention it in the post cause right now I'm not sure if it's a random effect or really a polynomial lift compensates normalization

link

afxuh 42 days ago

Cool idea. But it only works when the data never changes. could you make a streaming/incremental version? One that updates the math cheaply when new data arrives, instead of recomputing everything, or does the math fundamentally prevent it?

link

stephantul 42 days ago

Really cool! I was investigating PCA on retrieval, thanks for the references.

link

roger_ 41 days ago

So this is PCA in kernel space?

link