Hacker News new | ask | show | jobs
by carreau 991 days ago
As a curiosity, what would it entail to make the two tgz byte-for-byte identical ? There was/is some discussion in setuptools about how to normalize the tarball (https://github.com/pypa/setuptools/issues/2133#issuecomment-...) coudl something similar be applied to Building Python itself ?
4 comments

The suggestion there (uid = gid = 1000; uname = user; gname = users) isn't great.

Just use uid = gid = 0, and omit uname/gname.

If you're distributing software via a tarball, the uid/gid bits are meaningless. They only make sense when you archive / backup a directory and plan to extract on the same system.

If you set them to anything other than 0, it may happen that when the tarball is extracted as root user, ownership is changed to the uid/gid of the tarinfo provided those exist on the system. That's a lot of fun!

Python itself in fact tries to chown files when extracting a tarfile (under sudo).

If you set uid = gid = 0, then at least when extracting as root, the files remain owned by root.

Thanks for advice, and I assume you are the one who commented on the upstream issue. This show it is not trivial, and it would be nice to be done automatically by default.
I believe the only differences were uid/gid and username/groupname values between the two tarballs. One had the information of Thomas Wouters, the release manager of 3.12, and the other had generic GitHub Action usernames/groups.

Normalizing these values to something known like 0/0 would have done the trick.

Thanks for the article and taking the time to reply here.
> As a curiosity, what would it entail to make the two tgz byte-for-byte identical ?

It can't be that complicated. The tarballs autogenerated by GitHub (using `git archive`) were byte-for-byte identical for years, until GitHub upgraded git and things broke because entire ecosystems had started to rely on that.

[1] https://news.ycombinator.com/item?id=34586917

I suspect you're looking for pristine-tar(1)?

https://manpages.debian.org/stretch/pristine-tar/pristine-ta...

It's intended to solve exactly this problem, but in reverse -- a tarball is extracted to source, and we want to ensure that the sources we've extraced can be traced back to the original tarball.

Hum, that is interesting. I'm more thinking that in a perfect world the pristine-tar delta file should be empty. (Assuming I understand what pristine-tar is doing correctly).

For example I tend to use SOURCE_DATE_EPOCH to be the timestamp of the commit to make sure that anything that embed time is reproducible without extra instruction/manual process specific file.