| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ta20211004_1 1142 days ago

> My best guess is that the embedding represents the meaning of the natural language string it was generated from, such that strings with "similar" embeddings have similar meaning. But that's just speculation.

Yeah, you've got it. A mapping from words to vectors such that semantic similarity between words is reflected in mathematical similarity between vectors.

An idea of how you might train this thing: lets say the words "king" and "queen" are being embedded. In your training data there are lots of examples where "king" and "queen" are interchangeable, for example in the sentence "The ___ is dead, long live the ____", either word is appropriate in either slot, so each time we see an example like this we nudge "king" and "queen" a little closer together in some sense. However you also find phrases where they are not interchangeable, such as "The first born male will one day be ____". So when you see those examples you nudge "king" a little closer in some sense to other words which appropriately complete the sentence (which does not include "queen" in this case).

In this way, repeated over a giant training set with thousands of words, concepts like "male/female" and "royalty", "person/object" and tons of others end up getting reflected in the relationships between the vectors.

These vectors are then useful representations of words to ML models.

3 comments

atq2119 1142 days ago

Right, makes sense. But then what do you actually do with a database?

Starting with: what do you store in it?

Maybe sentence/vector pairs. But what does that give you? What do you do with that data algorithmically? What's the equivalent of a SELECT statement? What's the application that benefits an end user? That part still seems rather hazy.

link

ndriscoll 1142 days ago

I haven't worked in this space, but from what I gather, the idea would be something along the lines of the following:

An autoencoder is a model that takes a high dimensional input, distills it down to a low dimensional middle layer, and then tries to rebuild the high dimensional input again. You train the model to minimize reconstruction error, and the point is then that you can run an input on just the first half to get a low-dimensional representation that captures the "essence" of the thing (in the "latent space"). In this representation, images that are similar should have similar "essences", so their latent vectors should be near to each other.

The low dimensional representation must do a good job capturing the "essence" of your things, otherwise your reconstruction error would be large. The lower the dimension you manage to use while still managing to reconstruct your things, the better of a job it must do at making those parameters really encode the salient features of your thing without wasting any information. So similar things should be encoded similarly.

So imagine you've got a database of images, and you have a table of all of the low dimensional encoded vectors. You want to do a reverse image search. The user sends you an image, you run the encoder on it to get the latent representation, and then you want to essentially run "SELECT ei.image_id FROM encoded_images ei ORDER BY distance(encode(input_image), ei.encoding) LIMIT 10".

So you want a database that supports indexes that let you efficiently run vector similarity queries/nearest neighbor search, i.e. that support an efficient "ORDER BY distance(_, indexed_column)". Since the whole process was fuzzy anyway, you may actually want to support an approximate "ORDER BY distance" for speed.

In practice apparently the encoding might be taking the output of the first or nth layer in a deep network or something rather than specifically using an autoencoder. Or you may have some other way to hash/encode things to produce a latent representation that you want to do distance searches on. And of course images could instead be documents or whatever you want to run similarity searches on.

link

qorrect 1142 days ago

What a great explanation thank you.

link

browsewhilepoop 1142 days ago

Often the use case is search. Ex. You have a basic text search engine to find musicians on your site which does some string matching and basic tokenization and so on. But you want to be able to surface similar types of musicians on search too.

In that case you might store vectors representing a user based on some features youve selected, or a word embedding of their common genres/tags.

To actually search this thing, you need something to compare against. You could directly use the word embeddings of the search query. You could also do a search against your existing method, and then use the top results from that as a seed to search your vectors.

Since everything's a vector, you can also ask questions like "what musician is similar to Tom AND Sally" by looking for vectors near T+S. T-S could represent like Tom but not like Sally, etc.

So the answer to what do you store is, what will be your seed to search against?

link

cubefox 1142 days ago

Wow, that's interesting. So we can do vector arithmetic and the results make sense as a form of embedding/concept logic. Addition seems to work like "and" / set intersection. Subtraction works like set difference ("T\S", also literally written "T-S", i.e. "without"), which logically says "T, but not S", or in terms of predicate calculus, "T(x) & not S(x)".

Perhaps there is also some unitary vector operation which directly corresponds to negation (not Sally)? Perhaps multiplying the vector by -1? Or would (-1)S rather pick out "the opposite of Sally in conceptual space" instead of "not / anyone but, Sally"? And what about logical disjunction (union)? One could go further here, and ask whether there is an analog to logical quantifiers. Then there is of course the question whether there is anything in vector logic which would correspond to relations, binary predicates, like R(x, y), not just unitary ones, etc.

(Sorry for rambling, I'm thinking out loud here.)

link

frabcus 1142 days ago

The vectors are usually (if you use OpenAI API anyway) unit in length, and so you can imagine them on the surface of a hypersphere.

You measure the cosine distance between documents, or between search queries and documents. (Cosine is fast, there are other distance metrics).

The vector database queries will do things like given one embedding (document or query) find the nearest embeddings (documents). Or given two embeddings (e.g. a query and a context) with a weight for each one, find the ones that triangulate to being near both.

link

morgango 1142 days ago

Simple answer - you normally store text in it, but with the state of neural networks these days most things can be vectorized and searched.

So, coming myself from a database background but working in search, the SELECT statement (and joins) probably aren't the best way to get your head wrapped around things. I would think of the vector as a unique key for a record, and only using a LIKE statement for all my queries, but one that will return a probability of a match instead of an actual match.

A great use case is to think about similarity, where we want the things that are closest to what we want to see, but there isn't an exact match.

For example; a user gives me a sentence that says, "How long do I have to be with the company before I get a 401K match?". My vector store has a bunch of vectors including "A new employee will be eligible for 401K after 6 months." ,and, "The 401K program is run by <MEGACORP X>."

I would like to be able to see that the first vector is a closer match to the user sentence than the second, and by how much. I would also like to do this without having to change my code much based on the structure of the text. Luckily, there is a very simple algorithm for doing this (cosine similarity) that doesn't change regardless of the sentence structure or the question answered. Also, it doesn't matter what kind of question/answer you do as long as it can be vectorized, so you could even give me a vector representing an image and I can give you an image that is most similar.

Here is the most interesting thing about vectors -- with very little effort they turn the english language into a programming language.

Instead of typing "SELECT document_id, document_name, document_body FROM documents WHERE (document_body LIKE '%401K%' AND document_body LIKE '%match%' AND document_body LIKE '%existing employee%') FROM documents" I can just ask, "How long do I have to be with the company before I get a 401K match?" and I will get back a result and a match probability. How I change my text will change the matches, and can do so in ways that are profound and unexpected. Note that the SQL query I gave would not return any values because I didn't have any documents that had the term "existing" in them. Building the correct SQL query could be quite complex it comparison to just using the text.

This is pretty great for long-tailed search, q&a, image search, recommendations, classification, etc.

BTW, I am biased, I work for Elastic (makers of Elasticsearch) and we have been doing traditional search forever, and vector/hybrid search for the last few years.

link

esafak 1142 days ago

The use case is a specific type of search:

* https://en.wikipedia.org/wiki/Semantic_search

* https://en.wikipedia.org/wiki/Similarity_search

link

manytree8 1142 days ago

Great explanation, thank you!

link

therealdrag0 1142 days ago

How is each dimension maintained to have a sticky meaning among scenarios?

link

dsubburam 1142 days ago

Because the model used to compute the embeddings is the same across scenarios. You can infer meaning for each dimension by checking which inputs get embeddings that have large values for the dimension.

If the inputs are images, you may find that some dimension scores e.g. how much blue there is in the image. Though often it's not that simple (there could be multiple dimensions that relate to how blue the image is, especially if the embedding dimensionality is large, which it does tend to be these days. Though you could reduce the embedding dimensionality first using PCA, and see what input images correspond to high/low values of the first principal component, etc.).

link

ptaken 1142 days ago

Dimensions itself do not carry any meaning, what matters are the neighbors to maintain a sense of similarity. Think if it like a very complex point cloud. Applying an n-dimensional rotation leads to the same point cloud content wise.

As for the number of dimensions, in a sense they are a training variable just as the content itself. The more dimensions you utilize for your embeddings the more complex your relations can be during clustering. Too many dimensions can easily lead to over fitting however and too little dimensions can usually not accurately represent the training corpus.

link

esafak 1142 days ago

All the embeddings (vectors) are usually generated at the same time, and regenerated periodically. Does this answer your question?

link