Hacker News new | ask | show | jobs
by projectramo 2786 days ago
Hi mlucy,

I agree with what you're saying here. I just wonder how it would work in practice.

So imagine I have this monster text or image, and I want to know if it looks like another text or image.

I send each to Basilica, it gives me back two vectors and I compare the vectors.

I use the cosine of the vectors as a similarity score, and lets say it comes out to be 0.6.

However, I think this is too low, and I want to tweak my algorithm.

At this point, doesn't the question of how the vector was generated come to the front. Did you get rid of common words, how did you treat stems, and so on? Or did what biases did you introduce into training?

Furthermore, these questions come up right away, and they seem fundamental to whatever the main practice is.

In other words, can I even experiment or start without knowing how the word2vec works?

1 comments

You're definitely right that you sometimes need to know the exact details of how an embedding is produced, especially if you're doing cutting-edge work. That's one of the things we really need to improve documentation-wise. I'd like to have a page for each embedding that talks about how it's generated, what to watch out for while working with it, etc. etc.

I'm going to narrow in on the question of how to go about tweaking a model that uses an embedding, since I think it's a really interesting topic.

To use your first example, let's say you're doing the image similarity task. You probably wouldn't be computing the cosine distance on the embeddings directly. You'd probably normalize and then do PCA to reduce the number of dimensions to 200 or so.

If you weren't getting good results, you'd have a few options. You could fiddle with the normalization and PCA steps, which can have a big effect. You could also include other handcrafted features alongside the embedding. But let's say you have a fundamental problem, like your similarity score is paying too much attention to the background of your images rather than the foreground.

There are two major approaches to solving that sort of problem with embeddings: preprocessing or postprocessing. You could preprocess the images before embedding them to de-emphasize the backgrounds (e.g. by cropping more tightly to what you care about). You could also postprocess the embeddings. For example, you could label which of your images have similar backgrounds, and instead of naive PCA you could extract components that maximally explain variance while having minimal predictive power for background.

I definitely agree you should add more documentation to how the word vector model(s) is generated. Also you may want to have a set of models that the user can choose from. For example, Wikipedia is good for a general language use case. But something more technical, such as finance, SEC filings are a better data source.