Hacker News new | ask | show | jobs
by dantiberian 1207 days ago
How much of the latency reduction was due to the data services layer with request coalescing and how much was from ScyllaDB? Did the schema and other factors stay the same when migrating to ScyllaDB?
1 comments

> We performed automated data validation by sending a small percentage of reads to both databases and comparing results, and everything looked great. The cluster held up well with full production traffic, whereas Cassandra was suffering increasingly frequent latency issues.

Cassandra was still suffering even with the new data services.

>We elect to engage in some meme-driven engineering and rewrite the data migrator in Rust

Could you guys elaborate more on the data migrator by any chance?

The spark migrator was needlessly complicated in pursuit of being general. It is also much slower.

Ours is quite simple! We take the token range (int64 min -> int64 max), and sub-divide that into N equally sized ranges within the range that span the entire range. We use sqllite to store a database of "work to do" this contains the token ranges, some stats about work done so far, status, and last pagination token (this lets us pause/resume the migration.) We run N many tokio tasks that then just pluck work from the sqllite database, and update the database as work progresses, occasionally snapshotting the pagination token. Each task will insert to the destination database with a configurable amount of concurrency, however will only work on one token range at a time. Each task gets its own "connection" that is not shared with the other tasks to minimize contention. Distributing this workload across machines is very simple. We have a command called "sunder" which just subdivides the sqllite database into N many "shards" that contain 1/N the rows of the original database. We can then just copy these shards to other machines and start the migrator on them. We just spun up 5 chonky VMs on GCP and sundered the database, copied the binary and ran it in screen.

Since we already have the types of the rows described in Rust, we simply just plug the relevant types into the migrator. Everything just works, and you're using the exact same code that's reading/writing the data in the live service to also migrate the data (which further reduces the possibility of bugs). Since the migrator is built on-top of our data service library, we also get all the metrics/telemetry built in as well, so we can monitor the performance of the migration in great detail.

Thanks for the detailed reply, really appreciate it! This sounds super interesting :)
Yes, I'd love to read more about how it improves on the existing migrator.