Hacker News new | ask | show | jobs
by hlandau 674 days ago
Personally I've always considered it bad hygiene to commit generated outputs, but this article notes that this takes on a new significance in the light of supply chain security concerns. Good changes from PostgreSQL here.

Generated output, vendored source trees, etc. aren't, or can't be, meaningfully audited as part of a code review process, so they're basically merged without real audit or verification.

My personal preference is never to include generated output in a repository or tarball, including e.g. autoconf/automake scripts. This is directly contrary to the advice of the autotools documentation, which wants people to ship these unauditably gargantuan and obtuse generated scripts as part of tarballs... an approach which created an ideal space for things like the XZ backdoor.

8 comments

My take is that they should always be committed, but never generated by the dev, instead generated and pushed when necessary by CI. The problem with generating those files yourself is that, in many cases, it makes the output nondeterministic and nonreproducible. In the ideal world those tools would just generate those files deterministically, but until then for me committing them from CI is an acceptable stopgap
My preference is to do both. Have them generated by a dev, committed, and also generated in CI. The latter gets compared with the checked in contents to ensure the results match the expected value.

This speeds up CI (the generation path can be done in parallel) and most local development.

The one catch is that it relies on mostly trusting whoever has a commit bit. But if you don’t have that and any part of the build involves scripts that are part of the repo itself, then you’ve already lost.

> The one catch is that it relies on mostly trusting whoever has a commit bit.

Would the comparison not show that the person you're trusting goofed or is being malicious?

In either case it would prompt closer examination.

If the dev goofed, then good thing it got caught.

If the dev is not trustworthy, then you have evidence of such untrustworthiness.

My preference is to do both. Have them generated by a dev, committed, and also generated in CI. The latter gets compared with the checked in contents to ensure the results match the expected value.

Bingo. This is what I am working towards convincing people to adopt at my current job. It's a long road.

Would you happen to know of a documented workflow? Or blog posts that present solutions like this.

I would be very interested in how seeing how other people are doing it.

Thanks!

The generation routine bits would be highly specific to the project, but the final check in CI is as simple checking the git diff/status of the generated targets to see if they match the ref. Any deviance indicates that it’s been missed by the patch submitter (likely inadvertently in the case of honest actors).

The real work is being able to transform the generation task into a reproducible step that be run consistently anywhere. Containerizing those steps can help but it’s not strictly required nor is it enough if the “inputs” are a non-seeded random or the current time.

I have a simple script that asserts a clean working directory here https://github.com/mnahkies/openapi-code-generator/blob/main... which I use to check generated output hasn't changed after running the generation step in CI.

It relies on your generated artifacts being deterministic, which is a design goal of that particular project so works fine there.

No, they should be generated by either dev or something like pre-commit and then checked if they match what's generated by CI.

And yes, those have to be deterministic with regards to inputs, it does not make sense otherwise.

No unauditable generated code for me, either manually or automatically, thanks.
Why would generated code be unauditable?

The inputs and the generation will obviously be defined.

That's not the case for autotools output, or flex and bison output.

If the generated files are what you say? Well, just embed the generation step into the build system. A simple approach like that is easily made reproducible, and we avoid introducing noise into the repository.

The blog post do explain why some of the generation is done separately. But yes, that is also a viable approach.
> an approach which created an ideal space for things like the XZ backdoor.

That's not entirely correct. Indeed there was a part of the xz backdoor that lived in the configure script. However, that part was also included in the sources of the configure script as found in the tarball (and not in the git archive).

Thus regenerating the configure script didn't help, but regenerating the tarball did.

In this case, I can say autotools's advice is outdated at best, and one shouldn't follow it.

It adds unneeded complexity.

They are not and never did commit generated files (as far as I can tell). Their release process used to generate some files and place that into a distribution file, but that file was never committed anywhere.
The same applies to refactorings unfortunately.

If you make a large but simple refactoring, like renaming a frequently-used function across a large repo, nobody is going to audit that diff and check for extra changes.

Things don't have to be this way, Google's source control systems apparently has tools that can do such refactorings for you in a centralized fashion, and one could make something like that for git.

Going to the extreme of this though, I really really hate getting an autoconf project with no generated configure file. I don’t want to install the full autotools suite to do build!

On the other hand, keeping tarballs close to the git tree makes it easy to reuse git archive and related GitHub features, provided the repo properly includes some kind of versioning information in tree.

Linux software sources are in a weird spot between users and developers.

I, as a developer, organize sources in a way that make it easy to work for another developer. My software will never be compiled by any user. All my users use build artifacts.

I might consider adding autogenerated code, but only when I'm like 99% sure that this code won't ever change. For example that's the case for integration with many organizations where WSDLs are agreed upon once and then never touched. Having Java sources regenerated every build just adds few seconds to every build time without noticeable advantages.

The fact that some Linux users prefer to build software from the sources and at the same time do not want to install necessary build tools is a bit strange situation.

May be containers should be better utilized for this workflow. Like developer supplies Dockerfile which builds a software and then copies it to some directory. You're running `docker build .` and they copying binary files from the container to the host.

PostgreSQL also supports Meson which requires no generated filed to be convenient.
Including autoconf outputs servers to avoid having to have autoconf installed. Because autoconf installs historically lagged behind what autoconf-using projects wanted, this used to be a problem. Nowadays it's not that big a deal.

As u/nrabulinski says, you can have the CI system generate and commit (with signed commits) autoconf artifacts.

> Nowadays it's not that big a deal.

The same can be said about autotools itself :/

Historical and current use indeed vary, and many times even using autotools itself isn't as appropriate.

Generated outputs, especially when source code (headers, etc.) are important to keep for debugging later.