Hacker News new | ask | show | jobs
by lukegb 4042 days ago
Inspired by https://twitter.com/garybernhardt/status/600783770925420546
2 comments

Can someone explain in a bit more detail what this is about? Is the 'joke' that running data computation in RAM is faster than what? From disk?
The subtext is that running a fancy distributed system is more exciting and beneficial for ones resume than simply buying a massive bloody server and putting postgres on it, and that people are making tech decisions on this basis.
This of course ignores that it's much easier to get your hands on a cluster of average machines than one massive bloody server, and all the non-performance-oriented benefits of running a cluster (availability etc.).

Much easier to request a client provisions 20 of their standard machines, or get them from AWS. People don't like custom hardware, and for good reason.

Yes. Just yesterday in some other story everyone was arguing for the cloud because who wants to maintain their own hardware? This morning the hue and cry is "slap more RAN in that puppy".

I just speced out a 6TB Dell server. Price? It is already at $600K, and I haven't fully speced it out yet (just processor, memory, drive). Maybe that memory requirement is high (though it is about what I would need); 1TB is somewhat over $200K.

For the right situation that sort of thing maybe makes sense, though I'm SOL if I need high availability (power out, internet flakey, RAM chip goes bad, etc leaves me dead in the water).

So I would need to stand up a million or more in equipment in several places, or just use AWS and suffer the scorn of someone saying 'you could have put that in RAM'. Yes. Yes I could have.

> everyone was arguing for the cloud because who wants to maintain their own hardware?

Well, I keep arguing against that, because you still get 90%+ of the maintenance work, plus some new maintenance work you didn't have before, to avoid some some relatively minor hardware maintenance. And you can get most of the benefits of non-cloud deployment with managed hosting where you never have to touch the hardware yourself.

I work both on "cloud only" setups and on physical hardware sitting in racks I manage, and you know what? The operational effort for the cloud setup is far higher even considering it costs me 1.5 hours in just travel time (combined both ways) every time I need to visit the data centre.

For starters, while servers fail and require manual maintenance, those failures are rare compared to the litany of issues I have to protect against in cloud setups because they happen often enough to be a problem. (The majority of the servers I deal with have uptimes in the multi-year range; average server failure rate is low enough that maintenance cost per server is in the single digit percentage of server and hosting costs). Secondly I have to fight against all kind of design issues with the specific cloud providers that are often sub-optimal and require extra effort (e.g. I lose flexibility to pick the best hardware configurations).

Cloud services have their place, but far too many people just assumes they're going to be cheaper, and proceed to spend three times as much what it'd cost them to just buy or lease some hardware, or rent managed hosting services.

Even if you don't want to maintain your own hardware, AWS is almost never cost effective if you keep instances alive more than 6-8 hours of the day in general. Your mileage may wary, of course.

"The cloud" is basically a new non standard OS to learn.

I am reasonably happy configuring an old school Linux box. Heroku is much more of a pain in the arse to deploy to in my experience, despite much of the work being done for you already. Debugging deployment issues is particularly painful.

Depends on what you mean by massive bloody server. You can get a server with a terabyte of RAM for a price that's insignificant compared to the cost developing software to run on a cluster.
> You can get a server with a terabyte of RAM for a price that's insignificant compared to the cost developing software to run on a cluster.

This assumes that a) You're in the valley where average developer salary is $10k a month or more, b) You're a large company paying developers that salary.

There are lots of other places where a) Developers are cheaper, or b) You're a cash strapped startup whose developers are the founder(s) working for free.

Comparison still holds, because if you buy a cluster with X amount of RAM the price will be roughly the same as a single server with X amount of RAM. Except that for some large X there won't be any off the shelf servers you can buy with that amount of RAM (let's say 2000GB), but lets be honest here, 99% of companies needs are under that X especially if we're talking about startups.
You're assuming there are people competent of building such systems who are ignorant of the fact they can earn that money anywhere in the world.
Amazon offers some bloody huge servers... 32 core, 256GB RAM, and 48TB HDD space. d2.8x large
That is 4k a MONTH for 256gb of ram.

If you could do the same job on a fleet of 8-16GB servers.. you can get a lot more CPU for a lot less dollars. Depends if you really need everything on 1 machine or not (as of course nothing will beat same machine in memory locality)

Not true, 8x16GB costs as much as 1x256 on Amazon. The issue here is that Amazon is hilariously expensive in general. Hetzner will rent you a 256GB server for €460 per month. Or you can buy one from Dell for $5000. These are not high numbers, in 1990 you paid more than that for a "cheap" home computer. For the price of a floppy drive back then you can now get a 32GB server.
rackspace, onmetal-memory[1]: 512 GB, $1650/mo (3.22 $/gb/mo)

softlayer, dual Xeon 2000 Series: 512GB, $1,823.00/mo (3.56 $/gb/mo)

these are on-demand prices. pre-pay, or use a term discount, and its cheaper.

Build it yourself: You can build a Dell or similar on a 2-Xeon-proc (E5 series), your main limit is getting good prices on 16x 32GB DIMMS. But lets say you can buy the RAM for ~$6500, then its just dependent on the rest of your kit, lets say $10,000 flat for the whole server. $277.77/mo over 36 months, but you still need network infrastructure, and you might want a new one in 12 months, but you get the general idea.

[1] - http://www.rackspace.com/en-us/cloud/servers/onmetal

and fwiw, costs at amazon will scale linearly with resources. the 1 beefy box with 256GB RAM box costs about as much as 16 boxes with 16GB of RAM each.
If you're running a windows system licensing costs will be smaller when scaling up than scaling out, so there is that to bear in mind.
It may be more exciting, but don't those people know about the CAP theorem?
You should post that if it hasn't been posted already, it's a much better way to make the case than the current link.
Done: https://news.ycombinator.com/item?id=9582060

I originally saw it on HN, but almost two years ago. Old comments: https://news.ycombinator.com/item?id=6398650

There is no point deploying a heavy, complex (and usually pretty slow due to the overheads involved) distributed database, when you could just buy a server with xTB ram, load any sql database on it, and run your queries in a fraction of the time. If your data is so large that it can't fit in the RAM of a single machine, then distributed databases make more sense (since loading data off disk is very slow, modulo SSD).
For some problems SQL would already be way too much overhead.
Could you give a concrete example?

If your working set is small, say 1TB -- and so fits in RAM -- for what kind of problems would using SQL be so much of an overhead that you need a different approach? And what would that approach be?

I suppose you could have a massive set of linear equations that you might be able to fit into 1TB of RAM, but would be difficult to work with as tables in Postgres?

Take a graph, you could use an SQL database to store it and do your graph analysis using SQL, or, alternatively, you could convert your graph to an extremely compact in-memory format and then do your analysis on that. Much better efficiency for the same size problem, bonus: you can now analyze much larger graphs with the same hardware.
Or maybe a bit of both: http://stackoverflow.com/questions/27967093/how-to-aggregate...

I appreciate you taking the time to answer -- and I get that there's a reason for why we have graph databases. But I really meant something more concrete, as in here's a real-world example that isn't feasible to do on machine X with postgresql, but easy(ish) with a proper graph structure/db -- rather than "not all data structures are easy to map to database tables in a space-efficient manner".

Data that fits in RAM doesn't need any "Big Data" solutions.
I believe it's more "no, you don't need an Hadoop cluster of 20 machines, your data fits in the RAM of one machine".
People will build gigantic compute clusters with expensive storage backends when their entire dataset fits in memory.

If it fits in memory, it's going to be magnitudes faster to work with than on any other infrastructure you can build.

So the trick is, you take their "big data problem" and hand them a server where everything can be hot in memory and their problem no longer exists.

Right, RAM an order of magnitude faster than disk, so calculations will be performed very quickly. Big data usually implies clusters of servers because the data won't fit on one server (even on the disk).
Big Data usually implies 'big dollars', not necessarily a large amount of data. Simply use some in-efficient algorithm and a datastore with sufficient overhead and you're in Big Data territory.

Re-do the same thing using an optimal algorithm operating on a compact datastructure and you make it look easy, fast and cheap. Of course you're not going to make nearly as much money.

Most people who think they have "big data" problems actually don't have "big" data at all.
He's selling himself short.