| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bastien2 748 days ago
	You don't. You use a full-text indexer and normal search tools. A chatbot is only going to decrease the integrity of query results.

5 comments

andai 748 days ago

I found that grep actually outperformed vector search for many queries. The only thing I was missing was when I didn't know how exactly to phrase something (the exact keyword to use).

Do keyword search systems have workarounds for this? My own idea was for each keyword to generate a list of neighbor keywords in semantic space. I figured with such a dataset, I'd get something approximating vector search for free.

I made some attempts at that (found neighbors by their proximity in text), but I ended up with a lot of noise (words that often go together without having the same meaning). So I'd probably have to use actual embeddings instead.

More generally, any suggestions for full-text indexing? Elasticsearch seems like overkill. I built my own keyword search in Python (simple tf-idf) which was surprisingly easy. (Long-term project is to have an offline copy of a useful/interesting subset of the internet. Acquiring the datasets is also an open question. Common Crawl is mostly random blogs and forum arguments...)

skydhash 748 days ago

> The only thing I was missing was when I didn't know how exactly to phrase something (the exact keyword to use).

I think that's the only things GUI (or TUI) directories have over CLI. I remember having Wikipedia locally (english texts, back in 2010) and the portals were surprisingly useful. They act like the semantic space in case you can't find an article for your exact word. So Literature > Fiction > Fantasy > Epic Fantasy will probably land you somewhere close to "The Lord of The Rings".

ravetcofx 748 days ago

Do you know of any way to build a fast index you can run grep against? Would love to have something as instantaneous as "Everything" on windows for full text on Linux so I can just dump everything in a directory

semi-extrinsic 748 days ago

Have you tried the more modern solutions like gripgrep, ack, etc.?

Or for something more comprehensive (to also search PDF, docx, etc.) there is ripgrep-all:

https://github.com/phiresky/ripgrep-all

everforward 747 days ago

As others have said, ripgrep et al are faster than regular grep. You would probably also get much faster results with an alias that excludes directories you don't expect results in (I.e. I don't normally grep in /var at all).

I have seen some recommendations for recoll, but I haven't used it so can't comment. Anecdotally, I normally just use ripgrep in my home directory (it's almost always in ~ if I don't remember where it is). It's fast enough as long as my homedir is local (I.e. not on NFS).

jononor 748 days ago

Tracker is an open source project for that. It has been around for some 10+ years now. https://tracker.gnome.org/overview/

haiku2077 747 days ago

Try ripgrep.

j0hnyl 748 days ago

The point of vector search is to support semantic search. It makes sense that grep will outperform if you're just looking for verbatim occurrences of a string.

3abiton 747 days ago

A combination of both could help!

SkyPuncher 748 days ago

Most developers are going to outperform vector search. We “get” how computers do lookups so we build our queries appropriately.

Vector search is amazing for using layman concepts.

yreg 748 days ago

> decrease the integrity of query results

What does that even mean. When you know the exact keywords then you use full-text.

When you don't know them then other tools can be helpful.

eviks 748 days ago

It means you'd use the same tool since it's more convenient and get worse results in one tool vs. the other

Capricorn2481 748 days ago

Because they're two different tools for two different tasks. If you expect to always know the exact phrase than, yes, grep will be better. But if you search a semantically similar phrase you will get nothing

vikramkr 748 days ago

You wouldn't use a chatbot for the same query you'd use normal search tools for (and on a side note your answer would be much more useful with an example of what those tools would be, it's not really actionable). A vague natural language question over data whose structure you haven't fully understood using terms that might be inexact is not as likely to provide good results with normal search tools as with an llm based tool.

skydhash 748 days ago

> your answer would be much more useful with an example of what those tools would be

Paperless, DevonThink, even Calibre (the ebook manager) can do it.

You only need a day or two to categorize the documents. No need for huge amounts of RAM, or privacy concerns, or hallucinated answers.

dotancohen 748 days ago

  > You only need a day or two

For some of us, for some types of data, huge amounts of RAM, or even privacy concerns, or even the occasional hallucinated answer, is an easier pill to swallow.

A recent example, maybe not the best example but recent, was the query "What do the three headed dog from the Harry Potter books and the cat from Alien have in common"

brudgers 748 days ago

  They are fictional.

xeromal 748 days ago

I never want to categorize stuff. I want it done for me.

ajsnigrutin 748 days ago

Another (ugly but works nice): https://www.recoll.org/pics/index.html

opensource, local, yada yada, almost zero configuration (just add folders, run indexer, wait).

rahimnathwani 748 days ago

Paperless-ngx set up using docker compose is good for this use case.

barrenko 744 days ago

Hi bastien,

Could you expand on the answer? Thanks!