| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gk1 1653 days ago

Not much in common. A "vector" in Postgres is a tokenized and normalized array of words. So 'a fat cat sat on a mat and ate a fat rat' becomes 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'. This just makes keyword searches a bit easier. (Source: https://www.postgresql.org/docs/9.4/datatype-textsearch.html)

A "vector" in vector search solutions is a dense vector generated by a transformer model. A sentence like 'a fat cat sat on a mat and ate a fat rat' put through a model becomes an array of floating-point numbers like [0.183, -0.774, ...], with hundreds of values (often 768). The point is that this vector is positioned in a 768-dimensional space in close proximity to semantically similar sentences. So then searching by semantic meaning is a simple (well...) exercise in measuring the distance between your query and the surrounding vectors.

We have a whole course on the topic, coincidentally also on the front page of HN today: https://www.pinecone.io/learn/nlp

3 comments

rm999 1653 days ago

>A "vector" in vector search solutions is a dense vector generated by a transformer model.

Just a clarification that vectors are much broader than text vectors generated by transformer models. A more common application has been recommender models built on matrix factorization and other similar approaches. Word 2 Vec was another popular way to generate vectors dating back to 2013. Vectors are a very general approach with many benefits regardless of how they were generated. That is what makes these vector search libraries so exciting.

I wrote an article about the power of matrix factorization vectors for music recommendations back in 2016: https://tech.iheart.com/mapping-the-world-of-music-using-mac...

We also discussed how we used convolutional neural networks (deep networks, but not transformers) to build vectors on the acoustic content of music: https://tech.iheart.com/mapping-the-world-of-music-using-mac...

link

forgingahead 1653 days ago

Thanks for the nice and clear summary!

At what point does it make sense to switch to using vector search solutions (compared to keyword searches)? Obviously Google et al need it, but for regular apps, maybe academic repositories and so on, is there a threshold in which we can start the discussion to switch?

Or phrased another way, when do we know we should be looking at vector search solutions to enhance the search in our application?

link

gk1 1653 days ago

When your users start complaining that your search sucks. :)

Either because your catalog has grown to the point where a traditional keyword search makes it harder to find relevant items, or because users are increasingly expecting their apps to just "know what they mean" (like Google, Spotify, Amazon, and Netflix do).

link

forgingahead 1653 days ago

I run a super niche academic database - there is often that expectation that "search should know what I mean", yet there isn't enough data (in my opinion) to make any semantic search meaningful. There are about 23k data objects that can interlink, 50% of those belong to one specific data type, the rest are split.

So we've stuck with simple keyword search with filters to drill into specific categories within results, all pretty vanilla on a relational database structure.

Just wondering if there is a "volume" heuristic to this - I'd like to explore this more but realistically sometimes the academic user-base has big dreams with severe practical limitations.

link

jamesbriggs 1653 days ago

You can train a good sentence transformer on ~10K sentences using TSDAE (an unsupervised training approach), covered in Chapter 7 here: https://www.pinecone.io/learn/unsupervised-training-sentence...

link

unbanned 1653 days ago

Why 768? What are these dimensions

link

laingc 1653 days ago

It comes from the dimensionality of the hidden state of popular NLP models. The number 768 in particular comes from (if I recall correctly) the largest of the original BERT models.

link

unbanned 1653 days ago

So it's arbitrary then

link