Hacker News new | ask | show | jobs
by jsrfded 5702 days ago
We have our own crawl/index/serve technology end-to-end. We have a 3 billion page web crawl, a machine-learning trained ranker, and then the slashtag vertical features. Since BOSS gives us an additional 20-40B pages for very long tail queries, we fall into /yahoo if we don't have any of our own results.

We're auto-firing slashtags for certain regular queries now, e.g. [cure for headaches] will auto-fire /health, [industrial design colleges] will auto-fire /colleges. We're doing this initially for health, lyrics, colleges, autos, hotels, recipes, and personal finance.

Getting the crap from sites like ehow out of the results and pushing results into a curated set of high-quality sites for queries in spammy categories really cleans up the results there.

1 comments

Hmm, /lyrics in particular doesn't seem to do what I personally want, though I'm not sure how it could easily be fixed. What I want in order of preference is usually: 1. the official lyrics page, if any; 2. lyrics from a fan site, if one exists; 3. lyrics from one of the big ad-filled lyrics sites, like lyricsmode.com, only as a last resort (I tend to put them in the same category as ehow/etc.).

But it seems the /lyrics slashtag explicitly gives me #3, and actively excludes any results from the #1 or #2 categories that would normally come up.

For example, the ideal result for the search [pearl jam spin the black circle], imo, is the official page, http://pearljam.com/song/spin-black-circle. Without /lyrics this is the #4 result, which is decent. But when I add /lyrics, the official lyrics page gets excluded!