Hacker News new | ask | show | jobs
by nopurpose 71 days ago
> Last year, we rolled out a new service that changed how data is placed across Magic Pocket. The change reduced write amplification for background writes, so each write triggered fewer backend storage operations. But it also had an unintended side effect: fragmentation increased, pushing storage overhead higher. Most of that growth came from a small number of severely under-filled volumes that consumed a disproportionate share of raw capacity

Me thinking big corps with huge infrastructure bills meticulously model changes like that using the production data they have, so that exact change in all the metrics they care about is known upfront. Turned out they are like me: deploy and see what breaks.

1 comments

Author here :) We did have high-level metrics and expectations for how this change would behave, but a couple of factors made it much harder to reason about in practice that were happening in parallel.

Data in these systems moves slowly and with a lot of inertia, so the effects show up gradually and can lag behind the change itself. On top of that, the impact wasn’t uniform. Most of the overhead came from a small subset of volumes, so it took time to isolate what was actually driving the increase. These systems are hard to test at scale!