Hacker News new | ask | show | jobs
by manyturtles 434 days ago
About a decade and a half ago I worked on a large data migration project at a FAANG. Multi-exabyte scale, many clusters across many countries. Once everyone was moved the old storage platform wasn't completely empty, because the number of migrations was large and users were (naturally) more focused on ensuring their data was in place and available on the target platform rather than ensuring every last thing was deleted on the legacy platform. We weren't initially concerned about it because it would all get deleted when we turned down the old setup.

As we were gearing up to declare victory and start turning down the several dozen legacy storage clusters someone mused that given some users were subject to litigation holds -- not allowed to delete any data -- that at least some of the leftover data on the old system might be subject to litigation hold, and we'd need to figure that out before we could delete it or incur legal risk. IIRC the leftover 'junk' data amounted to a few dozen petabytes spread across multiple clusters around the world, in different jurisdictions. We spent several months talking with the lawyers figuring that out. It was an interesting dance, because on the one hand we were quite confident that there was unlikely to be anything in the leftovers which was both meaningful and not migrated to the new platform, while on the other hand explaining that it wasn't practical to just "go and look" through a few dozen PB of data. I recall we ended up somewhere in between, coming up with ways to distinguish categories of data like caches and working data from various pipelines. It added over six months to the project, but was quite an interesting problem to work through that hadn't occurred to any of us earlier on, as we were thinking entirely in technical terms about infrastructure migration.

1 comments

That does sound very interesting. Any insights on what would you do differently if you had to do it again? Any way to accelerate things now that you know the pain or do you think it's quite unavoidable and "legal Time"?
Doing it in parallel, which of course is only an option if you know about it. And templating that into a general approach and having legal folks sign off on that.