| HN Mirror

Sorry.. I forgot to respond to the last part of your message:

> I don't think it's an obvious decision

Certainly, which is why I even bother with discussion like this, in the hopes of making the decision clearer (of not obvious) to me in the future.

A response to my footnoted (in my other comment) comment pointed out how oversimplified my understanding of distributed databases was. Well, I knew it was an oversimplification, but not in which way.

There's plenty of computer science research from the 70s and 80s covering these topics, but they're both tough to translate to practical considerations, and they're woefully out of date (e.g. don't account for SSDs or cheap commodity hardware).

> But I think there are a lot of arguments for using something like Hadoop even before it's strictly necessary.

Well, philisophically, I would disagree with such an assertion on the grounds of premature optimization, absent the "strictly".

I would advocate for switching from scaling "up" (aka "vertically", larger single machines) to scaling "out" (aka "horizontally" or distributed) around the point of cost parity, not at the point it is no longer possible to scale up a single machine (unless that point can reasonably be expected to occur first, I suppose).

> I think part of the disconnect is that we have different backgrounds, so we both look at different things and think "oh, that's easy" vs "oh, thats like a pain."

That would account for any overestimation of how difficult it is to work with hardware or how complex Hadoop is set up, administer, or use. Those are just initial conditions and may well unduly influence decision making that has far longer-lasting consequences.

However, I'd like to think I'm not often guilty of the latter overestimation when discussing solutions (and even advocating single-server), as I tend to assume that it can it least become easy enough for anyone out there, so long as the technology is popular enough (like Hadoop) or traditional/mature enough (like the tools in the original comment, or PBS) that plenty of documentation and/or experts exist.

My background also includes having seen, first-hand, over decades, various attempts at distributed processing and databases in practice, with varying degrees of success. This has included early "universal" filesystems like AFS, "sharding" MySQL to give it "web scale" performance [1], Glustre and its ilk, some NoSQLs, and of course Hadoop.

If anything, I'd say that, with most popular, new technologies, especially ones predicated on performance or scale, "it's a pain" is not the knee-jerk skeptical reaction my experience has ingrained in me. Rather, it's more like "sure, it's easy now, but you'll pay." TANSTAAFL.

[1] This worked well enough but did have a high up-front engineering cost and a high on-going operating cost for the large number of medium-small servers plus larger than otherwise needed app servers to do DB operations that could no longer be done inside the database because each one had incomplete data. Due to effort overestimation, it was unthinkable to move from a VPS to a colo so as to get a medium-large single DB server with enough attached storage to break the "web scale" I/O bottleneck for years to come.