| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hlandau 674 days ago

Personally I've always considered it bad hygiene to commit generated outputs, but this article notes that this takes on a new significance in the light of supply chain security concerns. Good changes from PostgreSQL here.

Generated output, vendored source trees, etc. aren't, or can't be, meaningfully audited as part of a code review process, so they're basically merged without real audit or verification.

My personal preference is never to include generated output in a repository or tarball, including e.g. autoconf/automake scripts. This is directly contrary to the advice of the autotools documentation, which wants people to ship these unauditably gargantuan and obtuse generated scripts as part of tarballs... an approach which created an ideal space for things like the XZ backdoor.

8 comments

nrabulinski 674 days ago

My take is that they should always be committed, but never generated by the dev, instead generated and pushed when necessary by CI. The problem with generating those files yourself is that, in many cases, it makes the output nondeterministic and nonreproducible. In the ideal world those tools would just generate those files deterministically, but until then for me committing them from CI is an acceptable stopgap

koolba 673 days ago

My preference is to do both. Have them generated by a dev, committed, and also generated in CI. The latter gets compared with the checked in contents to ensure the results match the expected value.

This speeds up CI (the generation path can be done in parallel) and most local development.

The one catch is that it relies on mostly trusting whoever has a commit bit. But if you don’t have that and any part of the build involves scripts that are part of the repo itself, then you’ve already lost.

giancarlostoro 673 days ago

> The one catch is that it relies on mostly trusting whoever has a commit bit.

Would the comparison not show that the person you're trusting goofed or is being malicious?

rangerelf 673 days ago

In either case it would prompt closer examination.

If the dev goofed, then good thing it got caught.

If the dev is not trustworthy, then you have evidence of such untrustworthiness.

gjvc 673 days ago

My preference is to do both. Have them generated by a dev, committed, and also generated in CI. The latter gets compared with the checked in contents to ensure the results match the expected value.

Bingo. This is what I am working towards convincing people to adopt at my current job. It's a long road.

anbotero 673 days ago

Would you happen to know of a documented workflow? Or blog posts that present solutions like this.

I would be very interested in how seeing how other people are doing it.

Thanks!

koolba 673 days ago

The generation routine bits would be highly specific to the project, but the final check in CI is as simple checking the git diff/status of the generated targets to see if they match the ref. Any deviance indicates that it’s been missed by the patch submitter (likely inadvertently in the case of honest actors).

The real work is being able to transform the generation task into a reproducible step that be run consistently anywhere. Containerizing those steps can help but it’s not strictly required nor is it enough if the “inputs” are a non-seeded random or the current time.

mnahkies 673 days ago

I have a simple script that asserts a clean working directory here https://github.com/mnahkies/openapi-code-generator/blob/main... which I use to check generated output hasn't changed after running the generation step in CI.

It relies on your generated artifacts being deterministic, which is a design goal of that particular project so works fine there.

KptMarchewa 673 days ago

No, they should be generated by either dev or something like pre-commit and then checked if they match what's generated by CI.

And yes, those have to be deterministic with regards to inputs, it does not make sense otherwise.

EuAndreh 673 days ago

No unauditable generated code for me, either manually or automatically, thanks.

tjoff 673 days ago

Why would generated code be unauditable?

The inputs and the generation will obviously be defined.

EuAndreh 673 days ago

That's not the case for autotools output, or flex and bison output.

If the generated files are what you say? Well, just embed the generation step into the build system. A simple approach like that is easily made reproducible, and we avoid introducing noise into the repository.

tjoff 673 days ago

The blog post do explain why some of the generation is done separately. But yes, that is also a viable approach.

bonzini 673 days ago

> an approach which created an ideal space for things like the XZ backdoor.

That's not entirely correct. Indeed there was a part of the xz backdoor that lived in the configure script. However, that part was also included in the sources of the configure script as found in the tarball (and not in the git archive).

Thus regenerating the configure script didn't help, but regenerating the tarball did.

EuAndreh 673 days ago

In this case, I can say autotools's advice is outdated at best, and one shouldn't follow it.

It adds unneeded complexity.

bluGill 673 days ago

They are not and never did commit generated files (as far as I can tell). Their release process used to generate some files and place that into a distribution file, but that file was never committed anywhere.

miki123211 673 days ago

The same applies to refactorings unfortunately.

If you make a large but simple refactoring, like renaming a frequently-used function across a large repo, nobody is going to audit that diff and check for extra changes.

Things don't have to be this way, Google's source control systems apparently has tools that can do such refactorings for you in a centralized fashion, and one could make something like that for git.

prpl 673 days ago

Going to the extreme of this though, I really really hate getting an autoconf project with no generated configure file. I don’t want to install the full autotools suite to do build!

On the other hand, keeping tarballs close to the git tree makes it easy to reuse git archive and related GitHub features, provided the repo properly includes some kind of versioning information in tree.

vbezhenar 673 days ago

Linux software sources are in a weird spot between users and developers.

I, as a developer, organize sources in a way that make it easy to work for another developer. My software will never be compiled by any user. All my users use build artifacts.

I might consider adding autogenerated code, but only when I'm like 99% sure that this code won't ever change. For example that's the case for integration with many organizations where WSDLs are agreed upon once and then never touched. Having Java sources regenerated every build just adds few seconds to every build time without noticeable advantages.

The fact that some Linux users prefer to build software from the sources and at the same time do not want to install necessary build tools is a bit strange situation.

May be containers should be better utilized for this workflow. Like developer supplies Dockerfile which builds a software and then copies it to some directory. You're running `docker build .` and they copying binary files from the container to the host.

jeltz 673 days ago

PostgreSQL also supports Meson which requires no generated filed to be convenient.

cryptonector 673 days ago

Including autoconf outputs servers to avoid having to have autoconf installed. Because autoconf installs historically lagged behind what autoconf-using projects wanted, this used to be a problem. Nowadays it's not that big a deal.

As u/nrabulinski says, you can have the CI system generate and commit (with signed commits) autoconf artifacts.

EuAndreh 673 days ago

> Nowadays it's not that big a deal.

The same can be said about autotools itself :/

Historical and current use indeed vary, and many times even using autotools itself isn't as appropriate.

malkia 673 days ago

Generated outputs, especially when source code (headers, etc.) are important to keep for debugging later.