Hacker News new | ask | show | jobs
by papaf 5951 days ago
Two things surprise me about this article - probably because I've misunderstood it and don't see the big picture.

One is that there are master and slave databases and searches are done off the master - I've always seen them done off the slaves in other systems. The other is that they state that using MD5 doesn't allow for horizontal scaling. One of the qualities of MD5 is that all bits have an equal probability of being 0/1. Surely the last 1 or 2 bits can be used to indicate which server is holding the data?

2 comments

Searches are likely done off slaves - I suspect that is not presented properly because of the oversimplification of the diagram.

You can just use a few bits from an MD5 hash to decide server as long as you know how many servers you're going to have up front. The problem is that if you later wanted to add or remove a server, you would need to come up with a new scheme and move every piece of data around so it's on the right server (which would take days/weeks).

The more scalable/flexible solution is to use a consistent hashing algorithm (check out some of the papers on Chord) so that adding or removing a server doesn't require you to move as much data around.

The search machine is its own database. It feeds its data from the masters for consistency, but the searches themselves run against the search database.

I think the MD5 thing was covered well below.