Hacker News new | ask | show | jobs
by nsanch 5126 days ago
Unless he's planning to build sharding on postgres too, I think he's missing the point.
5 comments

Should be pretty trivial for him to shard collections across multiple databases with the same level of intelligence as MongoDB's automated sharding simply using the primary key, since MongoDB doesn't join.

Not sure how sharding is the point of MongoDB, though - in most of the universe, sharding is a database architecture/schema-level thing, not a database-server level thing, and for good reason - it's pretty darn hard for a database server to shard effectively without some knowledge of the app layer (and which keys are likely to become hot).

Personally, if I really wanted a flexible-schema "document based" database, I'd have implemented this using the FriendFeed K/V + Index model ( http://backchannel.org/blog/friendfeed-schemaless-mysql ) plus Postgres's HStore functionality, storing K/V per document in an HStore rather than in one giant K/V table like FriendFeed. That way I wouldn't need to use V8 and JSON parsing to run queries, and the mythic MongoDB-style "sharding" would be just as easy (just distribute the document -> hstore table across shards keyed on ID again).

I disagree that sharding can ever be a trivial problem if you're going to try to tackle moving data between shards while staying online. I'm not saying it's impossible, just that it's not trivial.
MongoDB's relatively simple approach involves continuing to use the old shard as an authoritative source (and committing updates to it) while shipping data in the background, then pushing the additional changes across and marking the new shard as "master."

Such an approach wouldn't be horribly difficult to implement in SQL using a copy table and write triggers - almost identically to how SoundCloud's Large Hadron Migrator allows writes to occur over a MySQL InnoDB table that's locked for migration (but even simpler because the table schema can't conflict afterwords).

The entire problem is admittedly nontrivial, since if the application happens to be writing data to the shard under migration too quickly (or the shard being migrated to dies), the server can end up in a situation where the new shard is never able to catch up and become a master. However, the easy solution (give up and retry later) is Good Enough for most situations (and is pretty much how MongoDB works).

"Unless he's planning to build sharding on postgres too, I think he's missing the point."

NoSQL seems to have three general claims (I'm not saying whether these are correct or not): (1) ease of administration in some cases; (2) different data model; (3) better performance or availability in some situations.

The author is clearly addressing the second, and you are clearly talking about the third.

Good point.
You have a good point. I must admit that after many years of using and loving PostgreSQL I have never had to scale it out: that is outside of my experiences.

On the other hand, old fashion MongoDB master slave configurations and now replica sets have been easy for me to set up when on just a few occasions I had to use a scaled out MongoDB setup.

while sharding is an important aspect of mongodb, i don't consider it the most important feature.
I don't know if it's the _most_ important feature, but I wouldn't build a serious site on top of anything that didn't have some sort of built-in sharding story.

With postgres you have to roll your own. If you want to bridge the gap from postgres to mongo, I think that's where you have to start.

"but I wouldn't build a serious site on top of anything that didn't have some sort of built-in sharding story."

There are many serious sites that don't need sharding.

Fair, my statement was overly broad. Sites that are read-only or store blob data in something like S3 can often avoid sharding for quite a while and rely on machines to just get bigger over time.

That said, if your site grows in some way you didn't originally anticipate and you get to a point where you need to shard, but can only do so by changing data stores, then it's sad.

> That said, if your site grows in some way you didn't originally anticipate and you get to a point where you need to shard, but can only do so by changing data stores, then it's sad.

I don't agree. Almost every site that grows in ways that weren't anticipated (or at a scale that wasn't anticipated) will have to make technical changes. If you don't need to make any changes it's almost certainly the case that you originally over-engineered. If you're optimizing for cases that you don't anticipate, I don't know what to call it other than over-engineering. Facebook started simple and only made Cassandra when they needed to, Google didn't have BigTable when they started, etc etc.

" Sites that are read-only or store blob data in something like S3 can often avoid sharding"

Still too broad. Sorry, this is a pet peeve of mine, where tech people assume that everyone else has the same issues as them. For example, I worked on an ecommerce site that made over a mil a year. They had less than 10k products, and will never need sharding. They are not read-only, they have people updating their products on a daily basis through the site.

Sorry, those were meant to be examples, not an exhaustive list.
"That said, if your site grows in some way you didn't originally anticipate and you get to a point where you need to shard, but can only do so by changing data stores, then it's sad."

I think you're being too absolute. For instance, Instagram used sharding in postgres, and they didn't have to throw anything away or dedicate any huge engineering team to solve it.

They had to put engineering effort into it. With Mongo, you don't.
I would say most sites will do just fine without sharding. You can get very far by just scaling up with a more expensive database server and caching the most common read operations. Some of the largest websites in the world do not need to do more than this.
Hi, disclaimer - I work for ScaleBase, giving a true automated transparent sharding, so I live and breath sharding for 4 years now...

The main problem is user/session concurrency. On one machine - it kills at some (near) point. A DB is doing much more for every write then reads (look at my blog here: http://database-scalability.blogspot.com/2012/05/were-in-big...). The limit is here and now, even 100 heavy writing sessions will choke the MySQL (or any SQL DB...) on any hardware.

Catch 22: Scale-out to repl slaves with R/W splitting? This can lower read load on the master DB, but read load can be better lowered by caching. The problem is writes and small supporting transactional reads, and slaves won't help. Distributing data (sharding?) is the only way to distribute write intensive load, and it also helps reads by putting them on smaller chunks, and parallelizing them is a sweet sweet bonus :)

As I see around (hundreds of medium-large sites) - there's no other way...

And one final word about the cloud: "one DB machine" is limited to a rather limited non-powerful virtualized compute and I/O space... In the cloud limits are here and now! Cloud is all about elasticity and scale-out.

Hope I helped! Doron

Like ? I don't know of ANY decent sized website that uses a single database server.
Postgres-XC is also in beta. Basically it's sharded PostgreSQL with full RI enforcement between shards, and seamless query integration. I assume they are working off the 9.2 codebase (hence the beta being the same time as Pg 9.2) but maybe it's only 9.1 (i.e. no JSON).

Postgres-XC is probably the most exciting PostgreSQL-related project out there. It promises full write-extensibility across the cluster without sacrificing consistency.

> I don't know if it's the _most_ important feature, but I wouldn't build a serious site on top of anything that didn't have some sort of built-in sharding story.

You know people say this, but in practice I find that simple hash bucketing with a redundant pair works surprisingly well, particularly in the cloud. Yes it isn't fancy, but it is trivial to manage and debug, and you can do a lot of optimizations given such a clear cut set of partitioning rules.

Your problems have to get really big before a more sophisticated mechanism really pays off in terms of avoiding headaches, and often the more sophisticated mechanisms actually cause more headaches before you get there.

"With postgres you have to roll your own [sharding]"

but it will be trivial if you do not do joins.

It also seems to be the flakiest feature
that doesn't mean that others don't find it to be an extremely important feature.
Easily enough done on Postgres-XC, and you still get full ACID compliance across shards as well!