Hacker News new | ask | show | jobs
by LaserToy 1354 days ago
Can it be used for large and fast changing datasets?

Example: 100 TB, write us every 10 mins.

Or, 1tb, parquet, 40% is rewritten daily.

2 comments

DVC is expressly for tracking artifacts that are files on disk, and only by comparing their MD5 hashes. So it can definitely track the parquet files, but you are not going to get row or field diffs or anything like that.

Maybe Pachyderm or Dolt would be better tools here.

Why would you use MD5 in anything written in the last 5 years? The SHA family is faster on modern hardware and there aren't trivial collisions floating around out there.
It was definitely a bad choice. I wasn't there so I can only speculate. My guess is because it is sort of ubiquitous and thus a low-hanging fruit and devs didn't know better, or the related corollary, it's what S3 uses for ETags, so it probably seemed logical. Either way, seems like someone did it and didn't know better, no one agrees on a fix or whether it's even necessary to change, and thus it's stuck for now.

There's an ongoing discussion about replacing/configuring the hash function, and it looks like there might be some movement toward replacing the hash and other speedups in 3.0

https://github.com/iterative/dvc/issues/3069

> We not only want to switch to a different algorithm in 3.0, but to also provide better performance/ui/architecture/ecosystem for data management, and all of that while not seizing releases with new features (experiements, dvc machine, plots, etc) and bug fixes for 2.0, so we've been gradually rebuilding that and will likely be ready for 3.0 in the upcoming months. - https://github.com/iterative/dvc/issues/3069#issuecomment-93...

Don't quote me on the specific hash algorithm, maybe it's SHA. Point is that it's just comparing modification times and hashes.
What about Apache Iceberg for those?