| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by PaulHoule 1736 days ago

Most search engines are pretty bad because the developers of most search engines don't do any work to improve relevance.

This methodology works

https://ccc.inaoep.mx/~villasen/bib/AN%20OVERVIEW%20OF%20EVA...

and I used it to tune up the relevance of a search engine for patents to the point where users could immediately perceive that it worked better than other products.

After I worked on that I wound up talking to the developers and/or marketing people for many enterprise search engines and few of them, if any, did any kind of formal benchmarking of relevance.

People at one firm told me that they used to go to TREC conferences because they thought it got them visibility but that they decided it didn't so they quit going.

A message I got repeatedly was that these firms thought that the people who bought the search engines didn't care much about relevance, but they did care about there being 200 or more plug-ins to import data from various sources.

In principle the tuning is unique to the text corpus. One reason for that is that there is a balancing act of having a search engine that prefers small documents (they have spiky vectors that look more like query vectors) or large documents (they have so many words they match everything.) Different corpuses have different distributions of document sizes, not to mention different distributions of words that appear.

Few organizations are willing to do the work to tune up a search engine (you have to decide about the relevance of 10,000+ document hits), but I've had the experience that you can beat the pants off the defaults even using a generic tuning. For instance that patent search engine was tuned up against the GOV2 corpus instead of a patent corpus. A small patent corpus showed us we were on the right track, however.