Hacker News new | ask | show | jobs
by deusu 3796 days ago
Crawler works from a single IP. User-Agent is fixed to the robot's UA. Cookies are totally ignored. The search-engine works with just 2 servers. Crawler/Indexer and Webserver/queries. Crawler is a root-server with 1gbit/s connection hosted in a datacenter. Webserver sits here at home with 200mbit downstream and 20mbit upstream.

I use the Alexa top-1-million sites as seed-list for the crawler. The errors that do appear during the crawl are either sites that have an outage or more likely simply dead-links. Oh, and URLs that turn out to be blocked by robots.txt. There are a lot of sites out there which block anything but Google and Bing from crawling them.

Cloudflare is not an option for me. It would let Cloudflare know what my users are searching for. VERY big no-no. :)

I can filter out 99% of automated queries. Luckily they are still pretty dumb at the moment and give me enough fixed clues to identify them.

I like your idea of keeping the API free with a very low request-rate. That could work. I would have to find a way that they can't just generate many API-keys though. Using captchas for API-key requests won't stop them from doing that.

I posted a "Show HN" about a year ago. Brought in about 1500 extra visitor that day. Got up to 9th place on the HN homepage that day.

New webdesign is already done. I have a German site too. https://deusu.de which actually gets 90% of traffic. That site already has the new design.

1 comments

1Gbps for the crawler totally explains it: I can see that doing 600 URL/sec. xD

I didn't think of CloudFlare being able to see the traffic... and wow, I never even processed that aspect of their service. But of course...

How good is Google's "[ ] I'm not a robot" checkbox thingy at weeding out bots? And perhaps you could use multiple captcha systems...? (Or are actual people tasked to do signups?!)

I shudder to think of such an idea, but linking API keys to <popular login-with/connect-with-this-site API> may be an alternative. (One thing that comes to mind is that, if someone authenticates using Reddit - which they can do without releasing any account info - is that you could check their (public, but unfakeable) karma counts and use that as a measure of confidence, in addition to the standard account age metric used everywhere.)

The new design is nice :D

And if it's been a year (!), another Show HN sometime would certainly be fine.