Hacker News new | ask | show | jobs
by munificent 45 days ago
> We chose a Saturday to format the entire codebase to avoid merge conflicts. And while our test suite gave us high confidence we'd gotten everything right, it's always a bit daunting to have a diff so large that GitHub can't render it.

The dart formatter has an internal sanity check. It walks through the unformatted and formatted strings in parallel skipping any whitespace. If any non-whitespace characters don't match, it immediately aborts. This ensures that the only thing the formatter changes is whitespace, and makes it much less spooky to run it blind on a huge codebase.

That sanity check has saved my ass a couple of times when weird bugs crept in, usually around unusual combinations of language features around new syntax.

(Unfortunately, the formatter in the past year has gotten a little more flexible about the kinds of changes it makes, including sometimes moving comments relatively to commas and brackets, so this sanity check skips some punctuation characters too, making it a little less reliable.)

3 comments

I imagine a fancier version would be to compare the Abstract Syntax Trees.
The balancing act is that the fancier your sanity check, the greater the chance of something slipping through its cracks too. Walking too strings in parallel is very simple and hard to get wrong. Traversing an AST and skipping a branch is exactly the kind of easy-to-make bug that the sanity check is designed to catch.

What I'd like to do is something somewhere in the middle where I walk the token stream and check that every token of the input ended up in the output, but I haven't figured out a simple and fast way to do that yet. Performance is particularly tricky because I obviously don't want to burn a bunch of CPU cycles on a sanity check that exists only to catch bugs.

I've always thought it would make sense for formatters to be baked into the toolchain so that they can reuse the language's parser (presumably exposed as a library) and then be implemented via parsing to AST and then formatted back out so that they're guaranteed to be correct and normalized. This doesn't seem to be how most formatters work in practice though, although I'm not sure if it's because of performance reasons or a lack of support for the parser being exposed in language toolchains.
That is essentially what clang-format is.
Good point, I hadn't really thought about it, but the name makes it pretty clear it's using clang's tooling. I only have worked a small amount in C++ in my career years back ago, but I distinctly remember feeling like clang-format was essentially perfect from my perspective, so it's nice to know that my abstract ideals bear out in practice.
The only issue is then you're at the mercy of whatever parser your formatter uses to construct the AST
Well, if any (common, non-hobby) parser is thrown off by the reformatting, then it's probably not a safe reformatting either way.
Strictly speaking that wouldn’t work, since a1 is different from a 1, for example.
I don't think they're trying to say that this is a sufficient test for correctness but a necessary one.
Correct. It won't catch 100% of possible bugs, but it will catch most.

The kind of bugs that are easiest to write in a formatter is dropping a bit of syntax on the floor and forgetting to include it in the output, and the sanity check will catch those.

It's also definitely possible to miss some whitespace that's necessary for things like identifier separation, but... <shrug> it's a sanity check, not a proof of correctness.

In practice, that's how most software testing works anyhow!
Lots of formatters also unify things like trailing commas, so it would be slightly more involved than this.
Yes, the dart formatter does that now too. So the sanity check ignores commas and semicolons, which makes it less robust as a sanity check, unfortunately.