I found that grep actually outperformed vector search for many queries. The only thing I was missing was when I didn't know how exactly to phrase something (the exact keyword to use).
Do keyword search systems have workarounds for this? My own idea was for each keyword to generate a list of neighbor keywords in semantic space. I figured with such a dataset, I'd get something approximating vector search for free.
I made some attempts at that (found neighbors by their proximity in text), but I ended up with a lot of noise (words that often go together without having the same meaning). So I'd probably have to use actual embeddings instead.
More generally, any suggestions for full-text indexing? Elasticsearch seems like overkill. I built my own keyword search in Python (simple tf-idf) which was surprisingly easy. (Long-term project is to have an offline copy of a useful/interesting subset of the internet. Acquiring the datasets is also an open question. Common Crawl is mostly random blogs and forum arguments...)
> The only thing I was missing was when I didn't know how exactly to phrase something (the exact keyword to use).
I think that's the only things GUI (or TUI) directories have over CLI. I remember having Wikipedia locally (english texts, back in 2010) and the portals were surprisingly useful. They act like the semantic space in case you can't find an article for your exact word. So Literature > Fiction > Fantasy > Epic Fantasy will probably land you somewhere close to "The Lord of The Rings".
Do you know of any way to build a fast index you can run grep against? Would love to have something as instantaneous as "Everything" on windows for full text on Linux so I can just dump everything in a directory
As others have said, ripgrep et al are faster than regular grep. You would probably also get much faster results with an alias that excludes directories you don't expect results in (I.e. I don't normally grep in /var at all).
I have seen some recommendations for recoll, but I haven't used it so can't comment. Anecdotally, I normally just use ripgrep in my home directory (it's almost always in ~ if I don't remember where it is). It's fast enough as long as my homedir is local (I.e. not on NFS).
The point of vector search is to support semantic search. It makes sense that grep will outperform if you're just looking for verbatim occurrences of a string.
Because they're two different tools for two different tasks. If you expect to always know the exact phrase than, yes, grep will be better. But if you search a semantically similar phrase you will get nothing
You wouldn't use a chatbot for the same query you'd use normal search tools for (and on a side note your answer would be much more useful with an example of what those tools would be, it's not really actionable). A vague natural language question over data whose structure you haven't fully understood using terms that might be inexact is not as likely to provide good results with normal search tools as with an llm based tool.
For some of us, for some types of data, huge amounts of RAM, or even privacy concerns, or even the occasional hallucinated answer, is an easier pill to swallow.
A recent example, maybe not the best example but recent, was the query "What do the three headed dog from the Harry Potter books and the cat from Alien have in common"
Do keyword search systems have workarounds for this? My own idea was for each keyword to generate a list of neighbor keywords in semantic space. I figured with such a dataset, I'd get something approximating vector search for free.
I made some attempts at that (found neighbors by their proximity in text), but I ended up with a lot of noise (words that often go together without having the same meaning). So I'd probably have to use actual embeddings instead.
More generally, any suggestions for full-text indexing? Elasticsearch seems like overkill. I built my own keyword search in Python (simple tf-idf) which was surprisingly easy. (Long-term project is to have an offline copy of a useful/interesting subset of the internet. Acquiring the datasets is also an open question. Common Crawl is mostly random blogs and forum arguments...)