Hacker News new | ask | show | jobs
by deckar01 800 days ago
I don’t understand the use case. You go through the trouble of generating checksums when copying videos, but don’t want to regenerate the checksums when modifying the metadata? If you are this concerned about data corruption why not check the metadata also?
3 comments

Author here. Surprised as any one this on the front page of HN.

> I don’t understand the use case. You go through the trouble of generating checksums when copying videos, but don’t want to regenerate the checksums when modifying the metadata?

Appreciate the Q, but I suppose I really don't understand it. Could be the hour?

I don't want to regenerate checksums once I know the underlying bitstream checksums are correct. I want to know the audio/video/whatever is the same as the day I received it, and I want to perform the exact same check to confirm. If I change the metadata, and I need to regenerate a checksum, I don't know that.

> If you are this concerned about data corruption why not check the metadata also?

One should of course. Please use ZFS, etc. There is perhaps no greater ZFS fan than me. See: https://github.com/kimono-koans/httm

But now imagine rewriting a stream to a different container. For instance, MP4 to MKV, or ALAC to FLAC. Wouldn't it be nice to know the bitstreams are the same?

I hope it was a pleasant surprise, I found this from a data archivist perspective. I can't believe that only FLAC had the foresight to checksum large binary data in the media codec space.

I notice that LLM releases will include md5/sha256 for the binary data, while excluding the json metadata. I really wanted MKV to have this functionality.

> I hope it was a pleasant surprise

Of course, very pleasant!

Is the idea that there's some inherent mistrust of `-c copy` or that sometimes downstream options affect it basically invalidating it?

Edit: I see the metadata benefit in the README, just curious if there's some additionally pessimistic perspective.

> Is the idea that there's some inherent mistrust of `-c copy` or that sometimes downstream options affect it basically invalidating it?

Yes, that's one reason.

I suppose the main mistrust along those lines is -- I have all these programs which manipulate my media metadata and sometimes changes the names or locations of my media files. And I'm basically fine with lots of small automated changes to my metadata from programs like `beets`. I'd just like some assurance whatever they spit out is what I started with.

With respect to metadata more specifically, if someone cleans up the metadata on an album or adds additional information, or album art, this shouldn't invalidate any checksum.

Network transfers of media could certainly benefit from this. If I send a ALAC album to someone, and they open it 3 months later, they should be able to know what I sent is what they are listening to, even after they retagged it.

My perspective. Checksumming is more useful on large binary data, whereas tools to check metadata/container corruption already exist[1].

This allows you to change metadata or the container entirely, while still being able to check if e.g. the H.264 video stream is okay.

[1]: https://www.matroska.org/downloads/mkvalidator.html

You want to be able to change the container while making sure you do not alter the contained stream.

I've always thought it would be simpler if we used different files for the stream and the meta data, but that's probably just because I never looked more closely into it.

From this perspective it may, but now you have multiple files you need to keep track of and it’s not clear if one is missing depending on the underlying stream structure (i.e. multiple audio or video streams instead of just 1 of each)