Hacker News new | ask | show | jobs
by userbinator 3007 days ago
Don't forget(!) Google's horrible forgetfulness, and its way of banning you if you try hard enough to extract anything useful from it:

https://news.ycombinator.com/item?id=16153840

For me, the niche is repair information and in particular, identifying IC part numbers and finding datasheets. Searching "service manual" now invariably brings up useless user's manuals, and searching too many times for IC part numbers gets you CAPTCHA-banned.

(Somewhat understandbly, part numbers tend to look like semirandom bot-queries, but it's still a horrible experience to be called a bot just because you're actually after more information than the average user.)

Keyword-based would be a great step forward(!), but something like "grep for the Web" would be ideal. I remember many decades ago learning how to use boolean operators and such, since nearly all search engines of the time provided such functionality. Now the mainstream ones which have a big enough index to be effective also have removed much of that functionality and try very hard to limit you from using it. For another example, try using "site:" searches multiple times with Google --- another way to get rapidly banned.

2 comments

When you find domains that contain useful information, crawl and index them manually.
Indeed, the best solution.

Interesting enough, I find separate web crawling as a service and search engine as a service, but not both?

You just described the Bing/Yahoo BOSS APi
Allright, I forgot that ones.

However they are quite pricey. Maybe some solution that one can host himself is a nicer alternative.

> something like "grep for the Web" would be ideal

A couple of these (e.g., Blekko) popped up 5-10 years ago. I don't think any made it far.

Some of them got bought like Blekko.