| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jamwt 5475 days ago

Without going too much into specifics and picking on individual databases (which I could do, boy do I have the scars...)

When you hit a certain traffic level, scalability, latency and robustness become far more important than single-node ops/s. I need to be able to add nodes and repair failed nodes while under load--I need the 99.9% latency mark to stay ~100ms while doing so. I don't really care how many bajillions of ops a second your database can do in some concocted scenario, b/c you're not going to do that many in the real world anyway (trust me, we tried). The disk subsystem is going to give you a few hundred, maybe a few thousand if you're lucky, IOPS, then your latency will spike to hell and your phone will wake you up at night.

Maybe in the world where 99% of ops are reads, you will put up impressive numbers, but now you're just showing you are pretty good at using the disk cache. That's a relatively easy problem.

The riak guys seem to get all this better than most: http://blog.basho.com/2011/05/11/Lies-Damn-Lies-And-NoSQL/

So, to give you a short answer to your direct question:

Use SLC SSDs + md + RAID-0. Have at least 5 nodes. Use bitcask, but realize that your keys will need to fit in memory. Also, realize that really small values aren't a great fit for Riak in some ways b/c the overhead per value is at least a few hundred bytes.

Also, it's important to note this is where I'm at right now, but maybe not where you (generally) are at. Riak may not make you happy at server #1, but it will make you pretty happy at server 10 and server 100.

Riak's sweet spot is people with scaling pains. If you only need a server or two to try some stuff, and you don't have any users yet, you might cause yourself more headaches than you need. Sometimes you don't need a locomotive, you need a motorcycle.

(These guys have a pretty great motorcycle: http://rethinkdb.com/ )

2 comments

grncdr 5475 days ago

For everything you've said about Riak re: stability and happy scalability, that's exactly why I was excited to include it in this test. I would like to prevent myself/my team from acquiring too many of those scars, so please elaborate on what caused them ;) Especially if those scars came from Cassandra or HBase!

The test hasn't been a "concocted scenario", it's measuring the performance[1] of a prototype implementations for what will be an essential piece of our infrastructure and process (bulk loads of large numbers of small records, very read heavy after the initial load). Riak's write performance was completely adequate, just nowhere near what we got out-of-the-box with the bulk insert operations available in Cassandra and HBase. I asked on #riak channel on freenode and got told to use protocol buffers (which we already were), I'd really appreciate advice beyond this.

> Also, realize that really small values aren't a great fit for Riak in some ways b/c the overhead per value is at least a few hundred bytes.

This is pretty much what I've chalked it up to. It's unfortunate because that is the use case for which we currently need to provide a solution for right now, and once we've got some of our data in one distributed data store, it's convenient (and considered less risky) to use that same technology for the next project. (This is really a culture thing though, it's taking us months to get the necessary buy-in and approval for a postgres 8.2 -> 9.0 upgrade rolled out for a different product, where we know it would solve a specific issue we have).

[1] We've been running our tests on a 4 node cluster, each node has an 8 core 2.8ghz xeon, 32gb of ram, and a woefully inadequate disk: the machines were repurposed from a system that required them to have redundancy and didn't require write performance, so the drives are RAID1. We also need to make recommendations to IT for their hardware purchase plan after our testing.

link

jamwt 5475 days ago

> The test hasn't been a "concocted scenario"

Btw, b/c I can totally see why you'd read it that way, that particular barb wasn't directed at you, more directed at some of the public benchmarks touted by (non-distributed) NoSQL database systems.

Re: cassandra, please see my reply to the sibling on this thread.

Also, feel free to email me jamie@bu.mp if I can answer any specific questions for you with things we ran into with various database systems.

link

StavrosK 5475 days ago

Can you make a blog post about the issues instead? It would greatly benefit everyone.

link

jamwt 5474 days ago

I'd love to, but there are things politically complicated about all this in the very small startup world. So maybe one day, but I can't do it right now.

link

StavrosK 5474 days ago

Ah, that's too bad. Thanks anyway.

link

slysf 5475 days ago

Have you looked into using ets storage instead of bitcask storage? Millions of keys at ~12bytes value would fit just fine in memory without the bitcask overhead and as long as your N value is > 1 you shouldn't have to worry about data loss unless your whole cluster loses power.

link

bretthoerner 5475 days ago

Was your experience with Cassandra different? Happy at server 10 and 100, that is?

link

jamwt 5475 days ago

TBH, we didn't seriously pursue Cassandra when we considered distributed database systems b/c the vast majority of "back-reference checks" we did on the YC network and other area startups was "stay away."

We got some very frank advice from some people whose opinions on databases I take very seriously to stay away, including reports from within FB.

Having said that, I cannot claim to have firsthand proven or disproven anything about Cassandra.

link

ericflo 5475 days ago

Facebook for a long time didn't use the vastly (and I mean vastly) improved open source version of Cassandra, instead opting for their internal fork. Instead of choosing to do so, I believe they have now switched to HBase, mainly for its easier consistency model. So I would take their advice with a grain of salt, because it's probably based on their experiences with an old fork.

There are a few people (YC companies even, alas) who are very vocally negative about Cassandra, but I also saw some of those same people ignoring direct advice given to them in #cassandra on IRC, and then turning around and bashing it when it didn't work as planned. Simply following the advice could have made for a completely different story.

I suppose the lesson to learn is that you need to develop software in a way that simply won't allow developers to shoot themselves in the foot, because people never want to blame themselves for doing it, they blame the gun.

link

ahik 5475 days ago

Eric, Jonathan Gray said it very clear at his talk at BerlinBuzz: Facebook is now using HBase instead of Cassandra. http://berlinbuzzwords.de/content/realtime-big-data-facebook... You can find a lot of info about the FB process to choose HBase in favor of Cassandra. This one for example: http://facility9.com/2010/11/18/facebook-messaging-hbase-com...

link

ericflo 5474 days ago

That's exactly what I'm talking about: that facility9 blog post explaining why they chose HBase had many factual errors about Cassandra when it was posted, and had to be revised after several respected people in the space contacted the author.

link

espeed 5475 days ago

Quora's decisions to not use Cassandra and Adam's answer regarding it lead me to the same conclusion (http://www.quora.com/Quora-Infrastructure/Why-does-Quora-use...). Evidently few from Facebook are advocating it.

link

dehora 5474 days ago

Quora on MySQL failed outright when AWS EBS failed, companies on AWS using Cassandra like SimpleGeo and NetFlix did not. To their credit, Facebook were clear enough on their reasons for using HBase over MySQL and Cassandra, such as wanting to double down on their current Hadoop system/knowledge and having easily obtainable ordering guarantees on messages. It's also clear they've invested in making HBase good enough.

At large loads and footprints, imvho, Riak, Cassandra and HBase present viable options. But there are some factors to consider that don't seem to get mentioned in the pop tech press

- What are you able to operate in production?

- What are you able/willing to debug and patch?

- What hardware options do you have?

- What are your workloads?

- Which variable of C.A.P, when you lose it, most damages your business?

- Will your company's choices be evaluated in the press?

- Does your board/investors have capital tied up in business's that are using something else?

- What architecture tradeoffs and styles sit well with you?

- What kind of data access and consumption patterns make you money?

- Can you pay for help?

The right choice is context sensitive, and I'm fairly sure for this class of systems at this point in time, there's no free lunch. That means you have to do the legwork for yourself and make your own choices and commitments; doing what you heard worked for someone else is a cargo cult.

link

espeed 5474 days ago

Or maybe it's wisdom. "A fool learns from his mistakes, but the truly wise learn from the mistakes of others." -- Otto von Bismarck

link