| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by drdeca 698 days ago

Very cool!

Do I understand correctly that this works by splitting each line into words, and using the embedding for each word?

I wonder whether it might be feasible to search by semantics of longer sequences of text, using some language model (like, one of the smaller ones, like GPT2-small or something?). Like, so that if you were searching for “die”, then “kick the bucket” and “buy the farm”, could also match somehow? Though, I’m not sure what vector you would use to do the dot product with, when there is a sequence of tokens, each with associated key vectors for each head at each layer, rather than a single vector associated with a word.. Maybe one of the encoder-decoder models rather than the decoder only models?

Though, for things like grep, one probably wants things to be very fast and as lightweight as feasible, which I imagine is much more the case with word vectors (as you have here) than it would be using a whole transformer model to produce the vectors.

Maybe if one wanted to catch words that aren’t separated correctly, one could detect if the line isn’t comprised of well-separated words, and if so, find all words that appear as a substring of that line? Though maybe that would be too slow?

1 comments

throwawaydummy 698 days ago

I wanna meet the person who greps die, kick the bucket and buy the farm lol

Are models like mistral there yet in terms of token per second generation to run a grep over millions of files?

link

ignoramous 698 days ago

Mistral has published large language models, not embedding models? sgrep uses Google's Word2Vec to generate embeddings of the corpus and perform similarity searches on it, given a user query.

link

throwawaydummy 698 days ago

No I got that I asked because wouldn’t embedding generated by fine tuned transformer based LLMs be more context aware? Idk much about the internals so apologies if this was a dumb thing to say

link

ignoramous 697 days ago

embeddings come in handy to augment LLMs [0], but as you suspect, some try LLMs themselves as an outright embedding model with varying degrees of success: https://www.reddit.com/r/LocalLLaMA/comments/12y3stx/embeddi... / https://huggingface.co/spaces/mteb/leaderboard

[0] https://simonwillison.net/2023/Oct/23/embeddings/

link