Hacker News new | ask | show | jobs
Progress in performance and scalability with CockroachDB (cockroachlabs.com)
105 points by awoods187 2766 days ago
9 comments

I've just released a small product using CockroachDB, in retrospect it was probably my favourite technical decision. Previously I'd used it as a toy and tested deployment strats but was skeptical (new tech and all that), but now that it's ticking along in the wild I'm very impressed across the board.
How does your infrastructure look like? Do you deploy it in a single datacenter or even in the same rack on a couple of servers?
Are you using enterprise ? If not how are you handling backups?
A few questions:

1) >631851 tpmC

How many servers are needed to achieve this throughput?

2) >4 terabytes of unreplicated, frequently accessed data

4TB unreplicated data? Does that mean if a single node goes down you'll lose data (EDIT: I meant losing availability, not data)? That kinda ruins the whole point of having a distributed database.

3) If I'm reading the KV benchmaks correctly, it takes 5 nodes to achieve 100k tpm. That's 20k tpm per node. That's 333 tps per node. This is a 95% point read benchmark. Why is the tps (333 tps) so low? Is that normal?

4) How does CockroachDB compare to other distributed databases such as TiDB, FoundationDB, ScyllaDB?

ScyllaDB is still the fastest at a key/value workload with per-query consistency settings and quorum reads/writes across multiple regions. If you need high-performance and low-latency, ScyllaDB wins. They are close to v3.0 which will have global secondary indexes and materialized views to improve data model flexibility. FoundationDB is also key/value but much lower-level and well proven for reliability. Don't have much experience with it and the latest release just introduced multi-regional capabilities, but the general tooling and documentation is still rough and it would take more effort to build a higher-level querying layer or client library.

TiDB is interesting, but missing more features from MySQL than CRDB is missing from PostgreSQL, so it's effective if you want sharding on mysql but will need a few more releases before it gets polished. Vitess and Citus are good options if you just want sharding on top of existing mysql or postgres with full query support within a shard. There's also Yugabyte which is multi-modal Redis/Cassandra/SQL offering with multi-regional capabilities.

CRDB is a great product with some of the easiest operations (although key management is a nightmare that they do not have a good plan for). It's fast enough for point-lookups and makes it easy to distribute and replicate your data across zones and regions. All nodes are part of a single cluster so read and write latencies will be high for global deployment, with the enterprise version having a workaround for local regional reads using pinned covering indexes. That works, but further lowers write performance.

It also has trouble with large transactions and the middle ground between OLTP and OLAP with heavy joins. Good choice if you need easy scalability and SQL interface over performance and complex queries.

Hi! I work for PingCAP, the company behind TiDB and come from previously working on MySQL.

The gap of features missing is documented here: https://www.pingcap.com/docs/sql/mysql-compatibility/

I would rate compatibility as actually pretty good: all but one SQL mode is supported (which is a feat in itself), and most of the SQL functions are supported.

There are some exceptions though, some which are addressable (missing functions) and some that are not (often a property of being an optimistic system).

We try to be as transparent as possible on this, which might be part of the reason why you feel there is a lot missing?

If you have specific examples, I would be happy to clarify. We also have a course designed for MySQL DBAs, which is designed to make the adoption easier: https://www.pingcap.com/tidb-academy

Good to see the progress, I was looking at the roadmap page: https://github.com/pingcap/docs/blob/master/ROADMAP.md

Views and CTEs are probably the biggest missing pieces now.

The technical design for views was recently completed, and I expect to see them added soon :-)

Window functions & CTEs are only very recent features in MySQL 8.0 (TiDB is 5.7 compatible). None the less, they are important for HTAP workloads, and I'm looking forward to seeing them too.

I'm the author of the post.

1. Between 90 and 135 16 vCPU nodes depending on cloud hardware 2. The cluster replicates this data three ways across all three nodes (so the cluster actually contains 12+tb of data) ensuring high availability. We intentionally reported the unreplicated number for clarity and comparison to TPC-C spec 3. Our graph is mislabeled. It should read transactions per second `tps`. Nice catch! 4. We can't comment on other database performance as they haven't release any TPC-C numbers.

“Between 90 and 135 16 vCPU nodes depending on cloud hardware ” How many nodes did you use in the CRDB 2.0 TPC-C 10k benchmark? Could I say that the "5x increment" is on the same hardware condition? Thanks!
This is a crazy multiple! Anyone from the Cockroach team up for sharing what the key innovations were that are driving the improved performance?
I'm the author. We've introduced transactional write pipelining (covered in a forthcoming blog post), load-aware rebalancing, and completed general performance tuning which all contribute to our improved performance numbers.
I was wondering, quite unrelated to the article, if anyone knows if CockroachDB would be suited for small databases (and comparably modest computing/memory resources). I very much like its distributed properties, but only have a simple table of usernames and corresponding cryptographic material. Is CRDB easy to run and manage?
We have clients using CRDB in pretty constrained environments, and they use it primarily because of the easy administration. I think you'll find it easier to use than a MySQL or Postgres, for example.
I would expect your use case would be better suited by use just using postgres. However if you do need to scale to the point where you'd need to distribute your database and take advantage of CRDB's capabilities, it uses the Postgres protocol, so you most likely can just migrate your data and use the same code.
Sqlite would be my first recommendation, unless you need client/server access.
I think the GP's stated need for replication would preclude SQLite unless one's willing to write one's own replication system.
Where's the stated need for replication?
"I very much like its distributed properties"
I’m currently evaluating this as an alternative to vitess + percona mysql. But strict seralizability has its limitations.
What do you mean "strict seralizability has its limitations", you need something stricter? Or you have a need for something weaker for some reason?
In the same way single threaded has limitations.
Why as an alternative to vitess? Are you hitting limits with vitess?
Because I believe the overhead to not be worth it. I’d rather a database be cloud native and have things like sharding and scaling built in and the coordination of that built in. I grok the CockroachDB way of doing things much more than how Vitess is and shocking fact most companies aren’t Google in size and don’t really need a Vitess or CockroachDB but get swept up in the Cloud,Cloud,Cloud! craze. The seemingly brittle nature of vtablets, vtgates, etc versus something built in is valuable to me.
Now we just need some benchmarks on a reasonable size dataset like 100TB and up
Not an easy benchmark..
One would imaging they are testing with much larger datasets internally.
I'm totally new to cockroach so I have 2 questions..

1. Is there a managed service of this db where it auto scales, does geo replication etc all by itself?

2. Is there any really good book on cockroachdb?

We released a managed version at the end of October, with auto-scaling, geo-replication, etc. --> https://www.cockroachlabs.com/product/managed/

Not sure that there any books on it yet.

The managed service doesn't autoscale, it's provisioned capacity by cores. We just did a call about it.
Our managed service is currently provisioned by cores. We automatically add nodes to your cluster based on your usage. You can also request to add more nodes if you anticipate spikes.
Thank you..
At what point is it cost effective to run CRDB vs PostgreSQL?
Someone should create a fork with work-safe name. CockroachDB brings connotation of cockroaches, who are known by eating almost everything and living almost everywhere.
you know what's ironic? that in every thread there's one of you people complaining about the name - you're all just like cockroaches! no matter how successful cockroachdb becomes, no matter how technically impressive the product becomes, the naysayers never die.

can you imagine 20 years ago someone complaining that google wasn't safe for work because it had a silly name?

newsflash dummy: stop saying/thinking/repeating stupid things like this and it'll stop being the case that everyone is so conservative that silly names are inadmissible.

That attitude does not fly with your boss.

Why limit the adoption for no good reason by choosing a weird name intentionally?

do you not understand perpetuation? "doesn't fly with my boss" ---> "won't fly when I'm boss". also how about having a conversation on the merits? does that fly with your boss? does with mine.
There's also the irony in their name - anticensor.
That's the idea though — your data will be hard to kill.
In their presentation they have it abbreviated as CRDB, so that might be a viable direction without pain.
> known by eating almost everything and living almost everywhere

Good attributes for a DB though?

I have to admit, the name does set me on edge slightly. shudder