That 400TB in the image is a large database! I'm guessing that's not the largest in the PlanetScale fleet either. Very impressive and a reminder that you're strongly differentiated against some of the recent database upstarts in terms of battle tested mission critical scale. Out of curiosity how many of these large clusters are using your true managed 'as a service' offering or are they mostly in the bring your own cloud mode? Do you offer zero downtime migrations from bring your own cloud to true as a service?
That particular cluster has grown significantly since the post was written, and yes there are now quite a few others that are challenging it for the "largest" claim. :-)
These larger ones are fully using the PlanetScale SaaS, but they are using Managed -- meaning that there are resources dedicated to and owned by them. You can read more about that here: https://planetscale.com/docs/vitess/managed
> you can run an initial VDiff, and then resume that one as you get closer to the cutover point.
VDiff (v2) only compares the source and destination at a specific point in time with resume only comparing rows with PK higher than the last one compared before it was paused. I assume this means:
1. VDiff doesn't catch updates to rows with PK lower than the point it was paused which could have become corrupt, and
2. VDiff doesn't continuously validate cdc changes meaning (unless you enforce extra downtime to run / resume a vdiff) you can never be 100% sure if your data is valid before SwitchTraffic
I'm curious if this is something customers even care about, or is point in time data validation sufficient enough to catch any issues that could occur during migrations?
You are correct about resuming. If you do an initial VDiff and then resume that same VDiff say 1 month later it would only diff rows with a higher PK value.
But there's also nothing stopping you from doing a new VDiff to cover all data at that later point in time.
"But there's also nothing stopping you from doing a new VDiff to cover all data at that later point in time." --- isn't this just pushing the same issue forward in time? How is data consistency maintained if a customer reverts back to original while having served a few request from new one already?
It's open source. If you really want to know these things, I would encourage you to look at the code and read the documentation. As noted in the blog post, reverse vreplication is setup when you switch. You can switch back and forth and nothing is lost.
"isn't this just pushing the same issue forward in time?" I don't understand what you are trying to say here. You can only compare the two sides / databases at the same logical point in time. While you are doing this comparison at that point in time, the timeline continues to progress. Unless you want to stop the world and prevent writes for the full duration of the diff (which can be days or even weeks).
I think it's still the same issue where data modified after the VDiff point in time isn't validated before SwitchTraffic. I'm mostly curious how vitess users handle this case, or if any users even care about about this case in the first place?
Is there no demand for continuous data validation similar to what TiDB offers?
Do people who care about 100% correct data validation just accept the downtime required to run a full VDiff before SwitchTraffic?
Enterprise grade nvme ssd's typically cost around 150$/TB. For RF of 3, this comes to around: 400 x 3 x 150: 180K USD. With a minimum of 5 year lifecycle for these enterprise SSD's, we are looking at 36K USD/year.
Going through their pricing (https://planetscale.com/pricing?engine=vitess&cluster=M-5120...), for just 15TB storage with RF=3, the pricing comes to around 24000 USD/MONTH, not year. Adjusted for 400TB and per year, this becomes 7.6 million usd. Of course, you also get a lot more, but, the difference is just insane.
That comparison doesn't make any sense at all, and you can't excuse it by tossing out "Of course, you also get a lot more". This is like evaluating the price of wheels by buying entire cars. You wouldn't get dozens of these servers just for capacity, you'd get a custom quote.
That said at $24K you could pay off an entire server like that from Dell in 4 months despite Dell charging something stupid like $2000/TB.
Your numbers are basically fine for what you're measuring, if you round up to factor in actually having servers to put the storage drives into. So 40-50k instead of 36k.
The issue is your budget is for 400TB of data but minimal requests per second. That's a valid thing to consider, but it's extremely apples and oranges to a fleet of 75 high powered servers.
To put it a different way, their prices are pretty high but the calculation of powerful servers costing 40x as much as raw storage isn't "insane".