Hacker News new | ask | show | jobs
by peteforde 643 days ago
Really cool to see my favs show up, but I honestly don't understand what we're actually looking at; the groupings seem very opaque beyond very general themes like sci-fi, startups, biographies, math, physics.

In other words, what are the clustering shapes telling us? Can we dig in based on geography, publishing date, key terms or themes?

Either way, I can't keep the site open for more than 30-40 seconds before it crashes. I suspect that's not the goal!

Is Cryptonomicon the best fiction book, or is the data wrong?

3 comments

There's a sort of regular repeating confusion with embeddings that they're very well behaved in visual dimensions.

IMHO it's a category error that results from tutorials using the king + female = queen example (which, funnily enough, wasn't even true for the original word2vec, if commentary I've read previously here is correct).

Working with them a lot has me picture them more as "a multivariate function that outputs 768 numbers, and was learned by brute force" than "something that sees in 768 dimensions" --- of course, they're both true, but the second interpretation shades more than it illuminates once you're past the very first interrogatory of "so what is this calculating, exactly?"

How behaved they are visually depends on what drives variance and what you’re hoping to see. There are certainly some nice properties in some dimensionality reductions, but if you flatten a space of faces it’s less likely that you’ll get the property of “brown hair” as a query embedded in any visually interesting way than actually putting in a face as a query.

More clearly, symmetric retrieval is easier to visualize in a dimensionality reduced space than asymmetric retrieval.

I suspect that some form of multi vector document embedding would be more understandable in the reduced space than this single vector representation.

The crash was indeed not intended - my mistake! Should be fixed now.

You've got the cluster semantics spot on, to be honest. Broad genres are grouped together, with a tendency for sub-genres to be grouped locally within those.

There is no interpretation of the overall shapes or the global structure, those are more a result of a particular UMAP run than inherent in the data.

Would love to provide different views on it and go more in depth next, thanks for the suggestion.

IMO, evolution over time is a great place to start.
> Either way, I can't keep the site open for more than 30-40 seconds before it crashes.

Yup, probably was about to happen to me too, had I not closed it.

CPU fan almost launched off the troposphere about 30 seconds in.

Probably a cluttered bunch of heavily unoptimized ReactJS modules in there (no offense to OP, I know it probably sped up development by 10x at least)

Nope, hug of death is seems:

Failed to load module script: Expected a JavaScript module script but the server responded with a MIME type of "text/html". Strict MIME type checking is enforced for module scripts per HTML spec.

Ad infinitum for a list of a couple .js files with repeating names.

Guess we'll have to come back in a day or two to experience it in it's full glory :).

Hey, thanks for reporting - this is fixed now. I messed up the static build and some browsers freaked out. By law of showing things publicly, I of course only tested in a browser that didn't. Hope you can give it another chance!