Hacker News new | ask | show | jobs
by nsanch 5126 days ago
I don't know if it's the _most_ important feature, but I wouldn't build a serious site on top of anything that didn't have some sort of built-in sharding story.

With postgres you have to roll your own. If you want to bridge the gap from postgres to mongo, I think that's where you have to start.

4 comments

"but I wouldn't build a serious site on top of anything that didn't have some sort of built-in sharding story."

There are many serious sites that don't need sharding.

Fair, my statement was overly broad. Sites that are read-only or store blob data in something like S3 can often avoid sharding for quite a while and rely on machines to just get bigger over time.

That said, if your site grows in some way you didn't originally anticipate and you get to a point where you need to shard, but can only do so by changing data stores, then it's sad.

> That said, if your site grows in some way you didn't originally anticipate and you get to a point where you need to shard, but can only do so by changing data stores, then it's sad.

I don't agree. Almost every site that grows in ways that weren't anticipated (or at a scale that wasn't anticipated) will have to make technical changes. If you don't need to make any changes it's almost certainly the case that you originally over-engineered. If you're optimizing for cases that you don't anticipate, I don't know what to call it other than over-engineering. Facebook started simple and only made Cassandra when they needed to, Google didn't have BigTable when they started, etc etc.

" Sites that are read-only or store blob data in something like S3 can often avoid sharding"

Still too broad. Sorry, this is a pet peeve of mine, where tech people assume that everyone else has the same issues as them. For example, I worked on an ecommerce site that made over a mil a year. They had less than 10k products, and will never need sharding. They are not read-only, they have people updating their products on a daily basis through the site.

Sorry, those were meant to be examples, not an exhaustive list.
Let me flip it around. You said, "I wouldn't build a serious site on top of anything that didn't have some sort of built-in sharding story".

I think the opposite. Most serious sites will never need sharding.

My definition of seriousness includes some relatively large scale. Your definition of seriousness appears to mean any site that is important to the person or business running it. Is that a fair assessment?

I think your definition is better, and I should've said "a potentially-large site" or something like that.

What do you mean by "serious"? That's broadly stated as well.
You're joking, right ?

Every large site on the internet has talked about strategies around sharding. At some point you are surely going to hit the physical limitations of one database on one server.

"That said, if your site grows in some way you didn't originally anticipate and you get to a point where you need to shard, but can only do so by changing data stores, then it's sad."

I think you're being too absolute. For instance, Instagram used sharding in postgres, and they didn't have to throw anything away or dedicate any huge engineering team to solve it.

They had to put engineering effort into it. With Mongo, you don't.
With Mongo, you don't.

Bullshit.

The sharding impl in MongoDB still[1] crumbles pitifully[2] under load. Regardless of sharding MongoDB still halts the world[3] under write-load.

Their map/reduce impl is a joke[4][5].

If you had done the slightest research you'd know that every single aspect that you need to scale out Mongo is either broken by design or so immature that you can't rely on it.

MongoDB may be fine as long as your working set fits into RAM on a single server. If you plan to go beyond that then you'd better start with a more stable foundation - or brace yourself for some serious pain.

[1] https://groups.google.com/group/mongodb-user/browse_thread/t...

[2] http://highscalability.com/blog/2010/10/15/troubles-with-sha...

[3] http://2.bp.blogspot.com/_VHQJkYQ5-dY/TUO3RAn8SNI/AAAAAAAABq...

[4] http://stackoverflow.com/a/3951871

[5] http://steveeichert.com/2010/03/31/data-analysis-using-mongo...

[1] looks like an example where the data didn't fit in RAM. Mongo works best when data fits in RAM or if you use SSD's. Yes, it's sub-optimal.

[2] is from a year and a half ago. It doesn't belong in a sentence that includes the word "still." I work at foursquare, btw. Those outages happened on my first and second days at the company. I wasn't so keen on mongo then either. We've gotten much better at administering it. Basically all our data is in mongo, and it has its flaws, but I'm still glad we use it.

[3] is also from a year and a half ago. Mongo 2.2 will have a per-database write lock, which is at least progress, even though it's obviously not enough. Since 2.0 (or 1.8?) it's also gotten better at yielding during long writes.

I have no experience with their mapreduce impl and can't speak to it.

Well that's just factually incorrect. MongoDB now has a per-database write lock and will have a per-collection write lock in the next version. So your halt under write-load statement is incorrect.

The map reduce implementation is quite new sure. But it is getting better and you can always link it up with Hadoop.

At the very least provide links that aren't nearly 2 years old.

I attended a talk by Instagram post-buyout. That's where I got the impression that sharding was not a huge obstacle for them (though it was significant). Keep in mind their entire data management team was 2 people I think.

My point was that sharding is not an absolute "have it or not". Some features require major engineering efforts to get anywhere at all, but sharding is not one of them.

But if you think the overall effort is less with MongoDB then go for it.

I would say most sites will do just fine without sharding. You can get very far by just scaling up with a more expensive database server and caching the most common read operations. Some of the largest websites in the world do not need to do more than this.
Hi, disclaimer - I work for ScaleBase, giving a true automated transparent sharding, so I live and breath sharding for 4 years now...

The main problem is user/session concurrency. On one machine - it kills at some (near) point. A DB is doing much more for every write then reads (look at my blog here: http://database-scalability.blogspot.com/2012/05/were-in-big...). The limit is here and now, even 100 heavy writing sessions will choke the MySQL (or any SQL DB...) on any hardware.

Catch 22: Scale-out to repl slaves with R/W splitting? This can lower read load on the master DB, but read load can be better lowered by caching. The problem is writes and small supporting transactional reads, and slaves won't help. Distributing data (sharding?) is the only way to distribute write intensive load, and it also helps reads by putting them on smaller chunks, and parallelizing them is a sweet sweet bonus :)

As I see around (hundreds of medium-large sites) - there's no other way...

And one final word about the cloud: "one DB machine" is limited to a rather limited non-powerful virtualized compute and I/O space... In the cloud limits are here and now! Cloud is all about elasticity and scale-out.

Hope I helped! Doron

Like ? I don't know of ANY decent sized website that uses a single database server.
An example is stack overflow. They've been able to scale up instead of scaling out.

http://highscalability.com/blog/2009/8/5/stack-overflow-arch... is the link I can find right now.

I'm not sure why you said "a single database server". No one said that. We're talking about sharding across databases, and the vast majority of sites don't need that.

Having a master/slave setup is more than a single server, but it's NOT the same as sharding.

On top of my head I can think of two: Wikipedia and leboncoin.fr (one of the largest websites in france).

EDIT: Both run read-only slaves. I was only talking about no sharding.

Postgres-XC is also in beta. Basically it's sharded PostgreSQL with full RI enforcement between shards, and seamless query integration. I assume they are working off the 9.2 codebase (hence the beta being the same time as Pg 9.2) but maybe it's only 9.1 (i.e. no JSON).

Postgres-XC is probably the most exciting PostgreSQL-related project out there. It promises full write-extensibility across the cluster without sacrificing consistency.

> I don't know if it's the _most_ important feature, but I wouldn't build a serious site on top of anything that didn't have some sort of built-in sharding story.

You know people say this, but in practice I find that simple hash bucketing with a redundant pair works surprisingly well, particularly in the cloud. Yes it isn't fancy, but it is trivial to manage and debug, and you can do a lot of optimizations given such a clear cut set of partitioning rules.

Your problems have to get really big before a more sophisticated mechanism really pays off in terms of avoiding headaches, and often the more sophisticated mechanisms actually cause more headaches before you get there.

"With postgres you have to roll your own [sharding]"

but it will be trivial if you do not do joins.