Hacker News new | ask | show | jobs
by notJim 5261 days ago
I thought Google's real innovation was their technique of using the interconnectedness of the web to determine the true value of content. So rather than only looking at the content of a page, they also look at the content from incoming links to that page. What package out there implements the algorithms for this, and is well-documented and trivial enough to use that a 14-year-old can understand them?

As far as I can tell, this article says 1) Shucks, hardware sure is cheap these days! and 2) There sure is a lot of software out there that you can mash together! Those things make it easier to start a company, but they don't provide the essential insights that make that company truly revolutionary.

5 comments

I don't think the point is that the breakthrough idea of today is within the means of some real fourteen-year-old. The breakthrough idea of today is something that today's concepts, economics, and best practices are NOT well-suited to handle; otherwise it wouldn't be much of a breakthrough. The amazing thing is how quickly something has gone from the realm of obsessed genius to the realm of the mundane. It goes back to Whitehead's observation that, "Civilization advances by extending the number of important operations which we can perform without thinking of them."

"Without thinking" is an exaggeration for some of the items in the post, but consider the problem of storing 200GB of data. "Um... on a hard drive?" "And how will you finance that?" "Gee, maybe with the money in my wallet right now? When do these questions get hard?" Shucks, hardware sure is cheap these days! Problems simply disappear from being challenges to not requiring any thought at all. The exponential increase in the power of affordable hardware may not be surprising, but to me it seems worth thinking about even though it's been normal and predictable my whole life.

I've said this before, I'll try to sum it up as succinctly as possible:

Google's innovation was 3-fold: better search algorithms (pagerank), which did use the implicit data from the interconnectedness of the web to judge the relevancy and rank of search results; revolutionary data center ops (using commodity hardware with heavy reliance on automation); and state of the art software engineering (sharding, map reduce, etc.) The last 2 enabled the first to run efficiently on a rather small set of hardware and to scale up speed just by adding more hardware. The end result was better results, delivered faster, and at lower cost to google.

This led to a much better product for the end users (better/faster) and allowed them to acquire a huge portion of search marketshare quickly. But the low cost of operations meant that they could better take advantage of advertising (lower cost per search means that even lower revenue per search can be profitable).

What package out there implements the algorithms for this, and is well-documented and trivial enough to use that a 14-year-old can understand them?

Nutch[1].

Nutch doesn't deal with modern web spam particularly well, but I'd say it matched early Google pretty well. Specifically, it implements Page Rank, has a reliable web crawler and a web-scale data store.

[1] http://nutch.apache.org/about.html

Wow yeah, that actually looks like it would do the job. There's a part of me now that wants to implement a spam classifier on top of Nutch to see how good of a web crawler I can create… thanks for the link!
even if you had had the same brilliant insights into the graph structure of the web when they did, you most likely would have failed because it was prohibitively expensive (the cost in the article is probably underestimated by orders of magnitude). it's simply a fact that:

1) getting the data, 2) computing the eigenvector of a large matrix, 3) and serving that data to users, wasn't cheap in 1998. it's comparatively dirt cheap today.

not to diss larry and sergey's impressive achievement - they were brilliant and they pulled it off - but i think back then game was so costly that a lot of brilliant people never made it to the starting line. it's cool to see that it's become a much more level playing field now. i'm curious what cool stuff we missed out on because of people who didn't make it to the starting line!

Agreed, the notion of pagerank and doing search properly in a time when it wasn't even on the radar is completely missing from this article.

The real message is that servers are cheap, albeit brought forward in a long vague buildup, and hardly novel information.