|
Very cool! Do I understand correctly that this works by splitting each line into words, and using the embedding for each word? I wonder whether it might be feasible to search by semantics of longer sequences of text, using some language model (like, one of the smaller ones, like GPT2-small or something?). Like, so that if you were searching for “die”, then “kick the bucket” and “buy the farm”, could also match somehow? Though, I’m not sure what vector you would use to do the dot product with, when there is a sequence of tokens, each with associated key vectors for each head at each layer, rather than a single vector associated with a word..
Maybe one of the encoder-decoder models rather than the decoder only models? Though, for things like grep, one probably wants things to be very fast and as lightweight as feasible, which I imagine is much more the case with word vectors (as you have here) than it would be using a whole transformer model to produce the vectors. Maybe if one wanted to catch words that aren’t separated correctly, one could detect if the line isn’t comprised of well-separated words, and if so, find all words that appear as a substring of that line? Though maybe that would be too slow? |
Are models like mistral there yet in terms of token per second generation to run a grep over millions of files?