Hacker News new | ask | show | jobs
by QuinnWilton 2894 days ago
A lot of people are commenting that SemVer doesn't work, because it's still at the mercy of humans choosing good version numbers.

Elm's package manager, elm-package, actually tries to remove humans from the equation, by automatically choosing the next version number, based on a diff of the API and the exported types of a package: https://github.com/elm-lang/elm-package#publishing-updates

It's not perfect, but it's better than anything else I've seen.

6 comments

It litterally changes nothing. I've had to fork a lib to bump a dependency version just recently.

The only things that protect elm from knowing dependency hell are the age of the ecosystem and the fact that there are not that many common package publishers outside of the core team.

That’s a pretty neat strategy. Elm really brings a lot of cool (and fun) sugar to dev flows. I’ve been looking for a reason (project) to use it in so I can transition from an enamored fan-person to a true evangelist :)
I'm doing this kind of automation with Maven in Java. There is a plugin (build helper I believe is the name) that gives you properties like "next.release.version", "current.release.version", "next.snapshot.version", etc.

So I've setup an infrastructure where you just click a button and it performs a release with _proper_ version number in accordance with semver, simply does the right thing. Works like a charm.

I don't understand complains about human factor when you eliminate it with ease.

And who decides if a change is breaking or not?
A very interesting talk by Rich Hikey (creator of Clojure) on the topic: https://m.youtube.com/watch?v=oyLBGkS5ICk

To summarize, any change that offers more (as in, more fields in the output) and/or requires less (more of the original parameters made optional) is non-breaking.

Edit: Of course, this is all under the assumption that the semantics of methods should never change. Instead of changing semantics, he proposes to just create a new method with a new name.

I can think of two (relatively) simple tests:

A) you generate an export of the public API (entrypoints, arguments, types) in some stable format.

B) you run the (public-api) testsuite of the previous release against the current release.

- If A yields no changes wrt the previous release, then B should probably succeed as well (barring changed tests). Bump the patchlevel.

- If A has changes but B succeeds, it's a minor version bump.

- If A has changes and B reports failures, it's a major version bump.

Of course, this isn't fool-proof, so you still want some way of overriding the automated version bump. But I think it's reliable enough to at least force developers to think about the changes they're making. Of course, it does rely on having a testsuite, and explicitly marking parts of it as public-api validation.

Probably pretty simple. Destructively making a public API change is breaking. Additively changing an API isn’t. Adding an optional public param or new symbols isn’t.

Harder to detect cases in which the signature stays the same but the expected behavior changes, though (as noted in TFA).

One somewhat subtle implication is that you need to have a clearly defined expected behaviour. If people have to read your implementation to use your API, then your implementation has become the specification and basically every change will be a breaking change.
Basically, if your old unit tests break, or if the post-conditions changed (in case you use design by contract), that's a new major version, otherwise, it's not.

But yeah, there are counter-examples, and anyway a user can expect a behavior to remain even if it is not explicitely specified. For instance, I developed a SAT solver. I could make a non-breaking change that improves the efficiency of the solver in like 99.9% of the cases, because of a cool heuristic. That would be considered a "patch", since I didn't change the API at all, and all unit tests are still working perfectly. But a former user could be pissed off because he's part of the 0.1% and now the tool became just a little too slow for his use. Is that a breaking change or not?

Unless you also provide performance guarantees ( specified by benchmark testing, as per your defination), then no
It's simple, everything that has existed should behave exactly as it used to before. In order to ensure this people invented lots so technics - code reviews, exploratory testing, integration tests and so on. It's even boring to discuss it, it's just basics.
He means, how can a machine determine it?
Ah, you mean what to bump and when? That's easy, you make a decision to break an API so you need a major version bump. In this case you perform a customized build overriding variables in CI GUI. All other releases are automated minor bumps if on master and patch version bump if it happens on release branch which we keep around for bug fixes. Those bug fixes propagated through release branches up to the master automatically whenever you merge a PR with a bug fix.
I don't know whether parent is doing this, but I used to use the Clirr maven plugin to fail the build if a release had binary-compatibility changes. You could then have a distinct release profile that permits a BC break but increments the major version number.
What about basing the decision on the API unit tests ? Combined with API signatures hashes, maybe there would be enough information to deduce minor changes, bug fixes and major changes
You can still break behavior without breaking the public API.

But yes, tools like this would eliminate most of the semver issues.

I can't believe such logic isn't built-in to more package managers.

Is API compatibility computable in general? My instinct is that it is, but I’ve never seen a theorem.
No, it isn't computable (that is, correctly determining one of "these functions behave the same" or "these functions behave differently", and not "unknown") in general, as it is equivalent to the halting problem. Consider these two versions of a function, are they API compatible?

  def foo():
    return True

  def foo():
    return halts("some turing machine program")
They're only truly API compatible if the program halts, but the program can be arbitrary, so proving that any 'foo' of this style are equivalent is solving the halting problem.

Of course, one can still likely get useful answers of "definitely incompatible" etc., with much more tractable analyses. AIUI, the Elm version ends up just looking at the function signature, and catches things like removing or adding arguments: for appropriately restricted languages, it is likely to even be possible to determine if downstream code will still compile, but that's not a guarantee it will behave correctly.

It's been a while, since I studied this, but linear bounded automata are arguably a better model for the real computers we can actually build and the halting problem is computable for LBAs.
Yeah, that seems true.

That said, in practice I'm not sure it is too useful (this is a slightly different question to the one originally asked, though). My understanding is the proof of decidability is essentially a pigeonhole principle argument based on LBA having a finite number of states: run the LBA for that many steps, if it hasn't halted, then it has looped. Even if a program is running on a system that only allows programs to use 1GB of memory, there's 2^(2^30*8) (approximately 10^(2.6e9)) states.

Essentially, the problem is not with computers, but with languages.

If your language lets you build a neverending loop/datastructure/etgc, then you cannot prove halting for all the programs that can be made by the language.

Since such languages exist, then by extension, it applies to computers.

I wonder if tighter guarantees can be given if tools can work with constructs like D's contracts. These are extra sections in each function that are intended to check invariants. If these invariants change, then they were either broken and needed fixing or the function had a semantic change.
That's an interesting idea. It seems like it would be a way for tools to flag "this function is likely to have changed in an interesting way", but changing invariants doesn't necessarily mean the function breaks semver.

For one, an invariant might be changed syntactically, without actually changing what it is asserting, reducing to exactly my example above: in its simplest form, the contract could go from in(true) to in(halts("...")).

Secondly, an contract could have been made more permissive, e.g. in(x > 1) becoming in(x > 0), or out(y > 0) becoming out(y > 1). Assuming violating a contract is regarded as a programming error (as in, it isn't considered breaking to go from failing an assertion to not failing), these are also non-breaking changes.

Lastly, changing behaviour doesn't necessarily mean changing invariants/contracts.

There can still be a semantic change, even if conditions are unchanged and valid.
D contracts are arbitrary code snippets. So they're Turing-complete, therefore it's still uncomputable.
The halting problem doesn't matter if your API is required to define timeouts / time guarantees. Thus if a#1.0() returns "hello" after 100ms and a#1.1() returns "hello" after 20s, even though they return the same result, they are deemed functionally different. This follows real-world expectations where performance which suddenly slows to the point of unusability is considered breakage, even if the same result is returned in the end.
Replace it with any other nonsense.

In practice your time guarantee isn't going to let you produce a sound and complete static analysis that proves the equivalence of two arbitrary (modulo termination guarantees) functions.

In the real world all programs have a timeout set to the time to the heat death of the universe. This hasn't helped us make sound and complete static analysis.

I think that's essentially the same point as LBA in a sibling comment. The same intractability-in-practice applies to it: even if your timeout is 10 milliseconds, that's still (tens of) millions of operations on most modern CPUs and order of a gigabit written to/read from memory, which is a huge state space.
I think it's more like a lower bound on difficulty. In other words, halting is one of many guarantees that would need to be checked to ensure that an API didn't break, and it is impossible to check, so it cannot be done, irrelevant of the difficulty of the other problems (like timeouts or whatever).
I don't think you understood my comment - you don't need to check for halting if all methods in the API have timeout guarantees. Halting is only a problem if runtime is uncertain. If runtime is certain (again, because of promised performance figures) then it is entirely possible to verify whether an API has been broken or not.
Thanks for that pedantic insight, doc.

It's patently obvious that API compatibility is a computable problem. You are just checking if API1 is a subset of API2.

Even if foo() never returns, the Application Programming Interface is unchanged if the function signature is the same.

Is it just pedantry? Even in the strongliest of practical strongly typed languages, two functions with the same signatures might fail to return on some inputs in the new version that it didn't in the old.

That's a practical distinction I care about that can't be computed.

If you're narrowly interpreting an API to mean just the function signature, then sure, if your type system is appropriately restricted then it is computable (although many type systems are Turing complete, meaning this won't be computable in all cases, for the same reasons). This is, in fact, something my comment explicitly called out.

Having signature compatibility is an important prerequisite, but I think it's possible worse to have code that compiles but does the wrong thing than to have code that doesn't compile (at least it's obvious to the developer that something needs to be fixed, in the latter case).

I'd imagine it isn't, at least depending on how you define API compatibility, and whether you're only looking at the API interfaces. Imagine two versions of a library that implement the function "add".

  Version 1:
  add Int -> Int -> Int
  add x y = x + y

  Version 2:
  add Int -> Int -> Int
  add x y = x * y
Both versions expose the same API interface, but the functions that conform to that interface are semantically different. A stronger type system could probably differentiate between the two functions, but I doubt you could generally compute whether both functions implement the same behavior.

Perhaps with some sort of functional extensionality you'd be able to compute compatibility perfectly, but I can't imagine that ever being feasible in practice.

That being said, what Elm does offer is still a huge improvement over humans trying to guess whether they made any breaking changes :)

You certainly cannot determine that two programs do or do not have the same behaviour in the general case.

In specific cases, the proofs can be pretty trivial:

https://tinyurl.com/ycx2245q - proof your two programs are different

https://tinyurl.com/y7lcv3re - proof 'y + x' is the same as Version 1.

These proofs weren't automatically discovered, though for such simple programs I'd expect an SMT solver to be able to find the proof or a counterexample easily enough.

But even proving they're the same value for all inputs isn't all that helpful, because of lazy Haskell code like:

    version 1:
    fib :: Int -> Int
    fib n = if n < 3 then 1 else fib (n - 2) + fib (n - 1)

    version 2:
    fib :: Int -> Int
    fib n = let fs = 1:1:zipWith (+) fs (tail fs) in fs !! n
They're (provably) the same for all (positive) input values, but if you call version 1 with n greater than, say, 35, you'll be waiting quite a while for an answer, while version 2 will be very snappy for the first few hundred thousand values of n, at least, after which the size of the answer will be a bottleneck.

If a library switched from the latter to the former, it'd have a good chance of breaking code.

While obviously exponential code is obviously exponential, this sort of behavioural change can show up in less obvious ways - a change in memory usage might blow your heap after a supposedly minor version change, eg.

In the end I don't think the value judgement of 'breaking' or even 'significant' is computable, and you'd need to rely on a human doing something that approximates to 'right' for your world view with their version numbering.

I don't think 'does it work exactly the same' isn't necessarily the right question, given there are expected to be bug fixes which may change the behavior in some functions.
The right question is "what's the difference between a bug-fix and a breaking change?"
Irrelevant. If fixing a bug is a breaking change then it is a breaking change. This is the importance of pre-releases/"nightly" branches so that you don't have a mistake of addNumbers(5,5) returning 25 instead of 10 and not being caught and then needing to increment a major number to fix a typo of × to +.

A bug fix is a change. A breaking change is a change that breaks the API. Doesn't matter if the previous status was the intended one or not.

SemVer not working is almost entirely people simply not following it correctly. The projects that follow it as best as they can have very few mistakes were "woops that was a major change" occurs while projects that use it as a guideline may as well not be using it at all.

Irrelevant?? That is the most relevant question that a developer could ask in this situation.

Often, there are ways to work around bugs in an API. And, just as often, those workarounds will break as soon as the bug is fixed. The API developer is in a particularly poor place to judge whether a bug fix is breaking or not. Sometimes, they can talk to users to see if a change will break their code, but other times they can't. Thus, it comes down to how well a developer can predict whether a fix will break projects in the wild.

A bug fix is not a breaking change. You are speaking in tautologies. "A bug fix is a change that is a change to the API". No, bug fix does not change the interface. A bug fix does not change the API. It does not require a semantic version change unless it also changes the expected behavior of the method, in which case it is a change in the documentation, which is part of the application programming interface.
A bug fix makes the function behave as it's documented to, a breaking change involves a change to the documentation?
Only as far as the type system supports, but I feel like constraining the minimum version unit to increment is far better than doing nothing. Disallowing incrementing the patch version when the types of existing API changes is great.
Type compatibility between two APIs would usually be decidable, because otherwise the type systems would be useless.

But behavior testing is undecidable, easily follows from halting theorem. I doubt it would even be recognizable.

I think we should stop publishing releases. Writing code is subject to human error and could break somebody else's work.
Well, if you wrote stable code in the first place, there wouldn't indeed be any need for updates!
Jokes aside, a lot of developers would call a project "dead" when is working as intended and not receiving new commits.
'ls' is dead, we should stop using it.
Bad example ;) Updated on June 17th: https://github.com/coreutils/coreutils/commit/24053fbd8f9bd3...

OpenBSD true(1) however... seems pretty much dead. https://github.com/openbsd/src/tree/master/usr.bin/true

It's pretty amazing.
True. Besides, my specs are future proof. So I always offer my clients a 120 years "no update needed" guaranty, with very durable titanium punched cards.
What alternative do you propose?
I think you missed the joke.
This seems like something solved decades ago with c header files, they're easy to do a diff on and the only false negative is from adding a function. Even that would be fine if you weren't export raw structs.

It seems like we gave up simple ways to do stuff like this because we hated header files and moved to tools like java and c# that eschewed them. Then they got reinvented and renamed to interfaces and we've come up with all sorts of other complicated tools (elm-package, mocking, IoC) just to recreate the functionality we lost in header files.

Diffing C header files has way more false negatives than just "adding a function".
Such as? White space changes can be ignored, comments can be stripped pre diff, both version can be run through the same formatter before hand. What you're then left with are the actual changes.

Anything removed or modified is a breaking change. Anything added to a public struct is a breaking change, which can and arguably should be hidden behind an interface. Adding functions is about all I can think of that's left.

Even if it's not perfect, it's a simple 90% solution with 40 year old tools.

What if:

* Function order changes

* Prototype goes from int function(char * ); to int function(char * arg);

* Typedef is added so instead of int function(char* );, it's typedef char* str; int function(str);

Good points, the second one in particular I don't think could be fixed without a full parser. Function order changes could possibly be worked around by a formatter/linter that can reorder functions, at the risk of creating more issues. The last could be handled be passing the code through the preprocessor (the -E flag in gcc) first.

By this point it's probably gone beyond the "perfect is the enemy of good" threshold though.

> The last could be handled be passing the code through the preprocessor (the -E flag in gcc) first

Nope, that only handles preprocessor directives (essentially, any line starting with '#'). Typedefs are handled by the parser.

That's still going to be way "less perfect" than diffing the output of `javap` or `godoc` or probably even the `help(module)` for a Python module.

The ability to diff APIs objectively gets better when we switch away from C header files.