Hacker News new | ask | show | jobs
by tgamba 2997 days ago
I worked at Alexa, then affiliatated with the Internet Archive, on exactly this, in the late 90s. We built our own server farm to crawl, process and store the data. We had 30TB of storage, holding three "snapshots" of the web, and thought we were pretty hot stuff. That would sit on your desktop today.

Crawling was the easy part. We had two processes of up to 40 threads each bringing the data down. Even this we had to throttle because we would use the bandwidth for the entire office, then based in the Presidio.

Processing the data was the bottleneck. Parsing, extracting and pushing to the database took months sometimes and the system broke down frequently. I was online 24/7 maintaining this system and it put me off working for startups forever.

All of the software, from the crawler to the parsers to the database system, were built in-house-- there was nothing out there to handle data of that scale at the time.

Our biggest concerns at that time were getting the cleanest data possible without duplicate pages, and being able to retrieve that data as fast as possible for real-time analysis. The engineers at Alexa produced some remarkable solutions to these problems.

Alexa's plugin gave us real time information on what people were actually looking at, and combining that with the the crawl data, we could have built PageRank. Alexa could have been Google, but went in another direction. We were acquired by Amazon in 1999.

To do this today would be an entirely different problem. The dynamic nature of the web, single-page apps, the orders of magnitude of scale--only the largest companies could begin from scratch with it.

However, you could build a simple system at home that could probably yield a few billion pages, process those, get users logs from some big routing point, and build a mini-Google.

2 comments

“Alexa could have been Google, but went in another direction.“

Curious, why did they go in another direction?

Amazon's stated reason for acquiring Alexa was to use Alexa's technology to build a recommendation engine. Search was never a priority for Alexa itself. We are acquired for $100 million, so it was take the money and run.
impressive thanks for the wonderful read.