Hacker News new | ask | show | jobs
by akoumjian 1342 days ago
Broad full text search is incredibly hard to do well. I've had to build, maintain, and improve multiple search systems. The difficulty is largely dependent on the context of the type of search problem you are solving. A lot of complexity depends on your answer to some of these questions:

  - Does the searcher already know the result they are looking for? (If yes, much easier)
  - Are there subjective and objective qualities of the results which should alter the search score, sometimes separate from the text being indexed? (If yes, much harder)
  - What is the quality of the text being indexed? (If end-user provided, this will vary widely)
Ultimately, building good search is often a struggle against providing the best possible results between searcher intent and incomplete document evaluation criteria. People never really think about when a search is working really well, but they definitely know and complain when it's working poorly.
1 comments

How do you classify the difference (in problem/scope, etc) of type ahead versus full blown search? It feels like these systems can be grown almost completely differently -- you could hack together completely in-browser "search" with datalists[0] and just prune it actively (and fool most users, depending on how varied searches were).

I do wonder how much deep search really matters when people only really expect to look at the first page.

[0]: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/da...

"Type ahead" or "autocomplete" is absolutely a different type of problem, and often simpler. This generally falls into the use case where the searcher already knows the specific item they are looking for. Often the results are objects owned by or known to the user in question, or you are searching through a very limited and relatively static set of documents and topics. Reference documentation for software often falls into this category.

In my experience, you don't have to spend a lot of time thinking about scoring and relevancy for these types of search. Generally you only want to include a small edit distance in the results at all to handle misspellings.

This is so vastly different when you have a corpus of millions of documents about an encyclopedia's worth of topics.

> I do wonder how much deep search really matters when people only really expect to look at the first page.

Getting the first page to have the best quality and relevancy is much more difficult if the user is searching through something like scientific papers, stock video footage. It is a challenge in bridging the distance between ideas and expectations.