| HN Mirror

Sorry I should have been more specific, I meant block deduplication, or any form of deduplication at a level lower than the entire file. File deduplication can only get you so far, depending on the use case. XetHub does block deduplication, whereas I am implementing data-level deduplication, which is slower in recreating dataset snapshots (can be parallelized and delegated), but allows savings on disk space with small but frequent changes and can be tied to collaborative features to show diffs, comment on them, and revert or edit changes where needed, all while pointing clearly to specific commits. And also potentially fork data or cumulative changes.

Yes I meant either checking out other branches locally, or in the general case pointing to another branch to indicate to any services to make data from that branch available to wherever it's consumed. I am assuming that each incoming new file is then added to data pipelines, possibly just a few. Sounds like you are in the sweet spot where you have the speed you want and, given unfrequent changes, you are fine with the versions taking up terabytes on Azure, since they are mostly new data.