Hacker News new | ask | show | jobs
by monkey26 4515 days ago
This article caught my interest as I've been reading into Cassandra. But some previous research had me thinking that Cassandra works best with under a TB/node. Is SQL still better when you have really large nodes (16-32TB) and only really want to scale out for more storage?

I'm currently humming along happily with Postgres, but some of the distributed features, and availability of Cassandra look really nice.

3 comments

Cassandra 2.0 can handle 5TB per node easily, 10TB with some care. Best to scale out, not up.

That said, if someone else has already made the hardware choice for you, you can always run multiple C* nodes on a single machine. I know several production clusters that fit this description.

PostgreSQL can handle petabytes easily. However if you need to query a petabyte of data, then you need to rethink your solution. PrestoDB + Hive + Hadoop may be what you need.
so can you put petabytes on 1 server? or can postgresql shard?
It's much more about desired usage patterns than amount of storage. Cassandra and RDBMS's differ quite a lot in how you replicate, consistency guarantees, performant read patterns, performant write patterns, how you handle recovery, etc... If you intend to bring anything to scale it helps to understand the strengths and weaknesses of the underlying architecture.
We run at about 1TB a node and it works well (high write load things like metrics and telemetry data). But we also use SQL server where appropriate (i.e. transactional account stuff).

I am a fan of using the right tool for the right job providing you have the team to support it.