Hacker News new | ask | show | jobs
by simonw 703 days ago
A subset of this idea is a hill I am willing to die on: the documentation for a codebase should live in the same repository as the codebase itself.

I'm talking about API documentation here - for both code-level APIs (how to use these functions and classes) as well as HTTP/JSON/GRPC/etc APIs that the codebase exposes to others.

If you keep the documentation in the same repo as the code you get so many benefits for free:

1. Automatic revision control. If you need to see documentation for a previous version it's right there in the repo history, visible under the release tag.

2. Documentation as part of code review: if a PR updates code but forgets to update the accompanying documentation you can catch that at review time.

3. You can run documentation unit tests - automated tests that check that the documentation at least mentions specific pieces of the code (discovered via introspection). I wrote about that a few years ago and it's been working great for me: https://simonwillison.net/2018/Jul/28/documentation-unit-tes...

4. Most important: your documentation can earn trust. Most documentation is out of date and everyone knows that, which means people default to not trusting documentation. If anyone who looks at the commit log can see that the documentation is being actively maintained alongside the code it documents they are far more likely to learn to trust it.

The exception to this rule for me is user-facing documentation describing how end users should use the features provided by the software. I'd ideally love to keep this in the repo too, but there are rational reasons not to - it might be maintained by the customer support team who may want to work in more of a CMS environment, for example.

9 comments

Love your blog, but in this case I want to take a more nuanced, if not opposite, stance:

There are many things closely related to code, that shouldn't necessarily live in the same repository. First, we need a common understanding of what should live together in a repository. This is much like the discussion about mono vs. multi-repo. A good rule of thumb is that if it is branched together, it lives together.

Effective documentation is not only a strict API reference, and not something that can be generated from docstrings alone. It offers a high level overview to understand the problem being solved, the architecture of the software, and a general roadmap of how it is developed. Effective documentation should cover both backwards and forwards revisions and how those migrations should be handled.

But this is also true on a reference level. Reading the documentation of a specific function I want to know if something relevant happens to this function in the next revision. There is nothing worse than checking out documentation for current production revision 34.5 and follow best practice there only to discover I should have checked out revision 34.6 instead because best practice changes there. Specific revisions should be documented, but documentation should not be limited to a specific revision.

There is a scale of how closely other artifacts follow code revisions: Tests is mostly branched with code, and should probably live together. Documentation can sometimes be branched with code, some should and some shouldn't live together with code. Deployment code and configuration management must be able to deploy old and new code from the same code base, and is even less likely to benefit from living with it. Then there's application state and test data which is something else entirely.

If the deployment code needs to be able to ship different versions, I would keep that deployment code in a separate repository - with its documentation bundled there.

The other form of documentation that I am passionate about is documentation that lives in issues, and then linked to from commit messages.

The great thing about issues and issue comments is that they have a clear timestamp attached to them, and there is no expectation that they will be kept up-to-date in the future.

This makes them the ideal place to keep documentation about how the code evolved overtime, and the design decisions that were made along the way.

That is also true. But I realize the above comment could be more clear, perhaps with an example.

A well working project such as git has a Documentation directory in the same repository. That's good, but that documentation is far from enough. The most canonical documentation is the "Pro Git" book. That documentation describes not only how to use the software, how versions differ and how functionality has evolved, and the what the internal data structures look like.

That documentation does not live in the git repository, and that's a good thing, as it is not versioned in the same way. That probably goes for a lot, if not most, of good documentation out there. Insisting on keeping documentation in the main code repository would go against that.

Sure, there's a whole world of documentation that can live outside of the repository - anything written by people outside of the core development team such as tutorials, books etc.

Of course, the problem with documentation like that is that it goes out of date almost by its very nature. The great thing about documentation in the official repo is that it can come with a guarantee to be maintained in the future - if that documentation gets out-of-date it's a bug, and should be fixed.

External tutorials and books carry no such expectation.

Yes, but that Pro Git is developed by people outside the core development team (whatever that might mean) is beside the point. The point is that it is documentation that does not move in lockstep with the software. And most good documentation doesn't!

Had Hamano or Torvalds written Pro Git, it would still have been worse off had it been forced into the release schedule of git itself. The most useful documentation describes all versions of the software, and should be only loosely coupled with it. The same can be said for web sites for software which is also a type of documentation.

(This is, incidentally, also why over-reliance on docstrings and documentation testing makes good documentation hard. Certain examples need to be produced by older revisions of the software, especially when incompatibilities are what needs to be documented.)

Not all documentation is like that, of course, but when someone successfully insists on hard coupling documentation to code, that puts a hard limit on the type of documentation that will be written.

Despite how much having common release process for code, documentation, and deployment code tickles our nerd fancy, we should consider the opposite, as there can be benefits from a looser coupling. Never let smart stand in the way of good.

As perhaps is obvious, I too have fought the same hill many times, but from another perspective. Docstrings are good. Documentation in the code repository is good. But that is only a small subset of all documentation. Blessing that subset as canonical, or insisting that should be all there is, is a much too common mistake.

"Not all documentation is like that, of course, but when someone successfully insists on hard coupling documentation to code, that puts a hard limit on the type of documentation that will be written."

I don't think I've ever seen a project argue so passionately for "all documentation lives in the same repo as the code" that people were put off writing books or tutorials that didn't go in that repo.

I'm pretty sure we aren't actually disagreeing here. I'm fine with "unofficial" documentation - books, tutorials etc - that lives outside the repo. The official reference documentation that's updated to reflect changes made to the project should live alongside the project itself.

"Specific revisions should be documented, but documentation should not be limited to a specific revision."

It's unclear to me what this is trying to argue. So apologies if the below entirely misses your point.

Technical documentation that refers to a codebase should live and be maintained with it. Otherwise there will certainly be drift. It can obviously still happen but at least it is provable it shouldn't have.

Not maintaining accurate documents is like disabling tests because they don't pass. It's easy to do but not right.

A checked in codebase to me should be as current and correct as possible. That includes accurate documentation.

I've rarely seen documentation that isn't tied to the codebase being maintained/valued.

> "Specific revisions should be documented, but documentation should not be limited to a specific revision."

I think this boils down to “what do you do if you realize the documentation for v1.1 says that some feature does X when it actually does Y, but you’re already on version 2.2?”

If v1.1 docs are tied to the version tag in VCS, that incorrect statement cannot be fixed.

And it seems that fixing that forces you into backporting documentation even if you don’t release and maintain parallel versions of your software, which… kinda sucks.

To be honest, I much prefer docs in the repo, because it facilitates code review — a good patch touches some implementation, some tests, and some docs.

The downside when only the latest few versions of the software are supported and only the very latest docs are maintained is that historical docs will probably not be fixed.

> If v1.1 docs are tied to the version tag in VCS, that incorrect statement cannot be fixed.

It can be fixed, just checkout the branch, git cherry-pick the updated change set, or even write it by hand, then do a re-release. You should have a process for this anyway as there might be a critical bug in that code that needs to be fixed and a re-release must happen.

Of course git's handling of branches leave much to be desired (I want mercurial to come back just for it's branch handling) and so developers often forget they can do this and it isn't really that hard. It is tedious though, and you will eventually have dual maintenance where you have to write the same code twice just because the two branches have diverged - this shouldn't be an excuse not to do it though.

> a hill I am willing to die on: the documentation for a codebase should live in the same repository as the codebase itself.

I'm a big fan of this and treating documentation like a first class citizen.

There's also another benefit I think should be explicitly mentioned. It makes debugging, onboarding, and solving things much faster. We all know and have experienced the joke where you question who wrote this pile of garbage to find out that it was you all along. But at the core of this joke is the fact that we can't even remember what we ourselves did. So while things make sense at the time and might even seem obvious, that does not mean it'll continue to make sense nor that it'll be obvious to others. Especially to people who are onboarding into a new codebase.

Yes, documenting while you code takes "longer." But it only takes longer in the short run. It is much faster in the long run. The question you have to ask is if you're doing a sprint or a marathon. But then again there's very ill advised and self-contradictory advice on well known sites[0] and some companies perform back to back sprints. But I don't think people realize we're the ones creating our own messes. As anyone with anxiety will tell you, when you are rushing around it becomes easy to overlook small mistakes that will compound and only accumulate to make your anxiety worse than it was had you just slowed down in the first place. Creating a negative feedback loop where you only get more stressed to end up creating more problems than you solve.

There's times to move fast and break things, but if you don't also dedicate time to clean up your house will be filled with garbage and inhabited by a Lovecraftian entities made of spaghetti and duct tape.

[0] https://www.codecademy.com/resources/blog/what-is-a-sprint/

5. The documentation won't get lost in a botched wiki migration or something like that.

The documentation in the repo should not be restricted to relatively low-level stuff about APIs, it should also include design documents and cover the higher level concepts the developers use to make sense of the app and its APIs. I can't tell you how many times I've seen these concepts lost after the original developers move on, and then get violated in ways that make the app much harder to comprehend.

The "documentation" for Lemmy consists merely of an auto-generated JavaScript library API dump with no real explanation for what most of the endpoints do (and are often named ambiguously) or how the general flow of things is supposed to work, or even how to do common things like find a user's comments or posts (would you have guessed they're both under "/user"? Because they sure don't tell you that). Especially if you don't know Javascript you're going to have a bad time trying to use that API. And the devs defend it if you tell them this, claiming "it defines everything perfectly, it's so easy."

One time my company purchased a $5k commercial license for x264 and were met with "the code is the documentation." That set us back literal weeks.

  > the documentation for a codebase should live in the same repository as the codebase itself
This! 100%. Emphasis - codebase documentation. Not user guides.

After doing this a couple times, it's a no brainer. The benefits are significant, the effort minimal. Just add a docs dir at the project root and go to town.

The docs dir has some very interesting stuff - how to run parts of the api locally, tricks to make auth bearable for local development, commands that get new team members going at hyperspeed, what parts talk to what parts, which files are important for what flows, why some refactoring was attempted but abandoned, high level limitations and benchmarks, history on how some monstrosity came to be with some jokes sprinkled about.

everything just one cmd+shift+f away.

It works for user facing documentation too. There are actually pretty good reasons for this - e.g. you can use the test to autogenerate up to date screenshots with playwright to put in the documentation.

I'm pretty convinced that there should be a single source of truth for specifications, tests and documentation but I think the industry will take a while to catch up to this idea.

I built a testing library centered around this (same as my username) but it's hard to get people to stop writing unit tests :)

I actually built my own Playwright screenshotting software with this idea in mind too: https://shot-scraper.datasette.io/ - I wrote about using that for my project documentation here: https://simonwillison.net/2022/Oct/14/automating-screenshots...

Really it comes down to the team you are working with. If you have user-facing documentation authors who are happy with Markdown and Git you can probably get this to work.

That's very cool.

I think screenshotting needs to be integrated into the tests though - if a scenario involves a wizard or something, the latter screenshots will be dependent upon the actions in earlier steps.

The thing I really want is automated short video demos, but I've not found a good path to those yet.
The skeleton test-generating-docs example I built to exhibit the framework actually does this. Here's an example:

https://github.com/hitchdev/hitchstory/blob/master/examples/...

It records a video while running the test and then at the end runs it through FFMPEG to make a smaller, slowed down GIF that can be embedded in the autogenerated docs.

It's quite rudimentary though. I've been meaning to try making something more sophisticated and even potentially do additional automated video editing to inject text from the steps or something.

I would be very happy if as much developer documentation as possible was actually executed as part of the code.

For example, a diagram of how different services interact can go out of date. It would be better if there was a config file describing which services can be called, and this config file was used to generate firewall rules (for the case where dependencies on services are missing) and alert rules (for the case where unnecessary dependencies are never removed). Another example might be OpenAPI docs that you use to validate requests and responses.

I think that when you enforce a common source of truth behind both your docs and the functionality of your system, those docs can never become outdated. If you just shove docs into git without using them for anything they can easily rot away.

I have often wondered why Android's javadoc is so awful ... and thought, maybe precisely because its its embedded in such a large codebase it doesn't get updated for risk vs perceived benefit reasons (becuase of proximity of javadoc to code). Of course, it could be cultural or other things ... Perhaps the tooling sees changed sources, false positives for code changes, and there is a desire to eliminate this to help downstream consumers etc?
True. Confluence or whatever corp shitware is where technical documentation goes to die.
I’ll happily die on that hill with you