Hacker News new | ask | show | jobs
by mabbo 1254 days ago
> But think: everything in those describe blocks had to be written by hand.

It also had to be thought about by the developer. Someone had to say "I want the code to do this under these conditions".

If your tests can be autogenerated then they aren't verifying expected behaviour, they're just locking in your implementation such that it can't change later. They are saying "hey look everyone, I got my coverage metric to 100% (despite any bugs I may have)."

9 comments

One of the projects at a place where I have worked was set up so that when you ran the tests it automatically and silently updated the values that were expected. Completely bonkers because the first time I was contributing to the project I prepared the tests first and then started the implementation, and then while I was working on it I ran the tests which at this point should fail because I hadn’t finished writing the code but instead all tests passed. Because helpfully the test setup overwrote the expected values that I had prepared in my new tests, with the bad data. Yeah great, very helpful >:(

Oh yeah and the whole test setup was also way too tied to the implementation rather than verifying behaviour. Complete trash the whole thing.

I keep rereading this hoping I'm misunderstanding.

That is cargo cult level behaviour. They know that software with lots of tests tend to have few bugs, so let's automatically have lots of tests!

I just hope whatever you were building wasn't critical to human lives.

https://en.m.wikipedia.org/wiki/Cargo_cult

> That is cargo cult level behaviour.

One person's "cargo cult behavior" is another person's "best practices". :P

My favorite example is automatically generated documentation. The kind that merely repeats the name of the method, the names and types of arguments, and the type of return value. The ironic part is that this is later used as an evidence that all documentation is useless. Uhm, how about documenting the methods where something is not obvious, and leaving the obvious ones (getters, setters) alone? But then the documentation coverage checker would return a number smaller than 100% and someone would freak out...

This is just one of many examples, of course.

I hate to dwell on this, but I've also seen it in real life and it boggles the mind.

Like "give review feedback that this code isn't doing the right thing" -> "change the test to make it pass, not change the code to make it work". And it wasn't really a small case where you could plausibly do that and still understand what you were trying to do.

Coincidentally that was a few weeks after I saw a comment here on HN about someone who hired someone from Facebook, and the guy would change the tests so he could push to production, rather than fixing the bug that the tests pointed out ...

So yes it happens.

>Coincidentally that was a few weeks after I saw a comment here on HN about someone who hired someone from Facebook, and the guy would change the tests so he could push to production, rather than fixing the bug that the tests pointed out ...

Can't blame him, he moved fast and broke things /s

Perhaps he's a Buddhist? "If the software is going to break, then the software will be broken." Then he adds a little wabi-sabi for good measure. https://en.wikipedia.org/wiki/Wabi-sabi

I remember once, using some in-house software, which for god knows why could not log it's errors back to the IT department. Instead, they relied on users to call up IT, or email them with the error. To make it more fun for users, each error message contained a humorous haiku.

  Chaos reigns within.
  Reflect, repent, and reboot.
  Order shall return.
Edit: Just found this from 2001 https://www.gnu.org/fun/jokes/error-haiku.en.html And my experience with haiku error messages at work was 01 or 02.
Would it do this just the first time? It’s still bad it was doing this silently, but it’s pretty common to test web APIs in a similar way manually. Make a request, check the response you get back looks right (important step) and then save it as the expected value.

Edit: or after reading the article, like in the article.

It did this every time, not just the first time.
Well, you know what they say: Expect the unexpected!
I can somewhat understand, because this is kind of the goal of property based testing—the actual values themselves matter so little to the test that you’re willing to subject those inputs to randomness

That said, this doesn’t sound like a very good way to pull that off because the developer has no control over that randomness (where it’s needed greatly).

So long as the diffs get reviewed and checked in, this is a great form of testing called "regression testing". It doesn't replace unit testing, but it can be super valuable.
What’s described in the OP (Jane Street) is regression testing.

What the commenter just described is tautology testing: whatever result of the computation I get is what I expected.

You are missing the point entirely. It’s actually discussed at length in the article btw if you had bothered reading it.

Regression tests are extremely useful because you don’t want working code to get broken but they are tedious to write. What the author is describing is pretty much how everyone does it if you want anything moderately complex in the test, you just run and then copy-paste. Having something do it for you in a frictionless way is a huge win.

Plus the way the framework works you can still test expected behaviours before writing the code if that’s what you actually want.

Think of it as manual testing where your work is captured so it can be ran later in an automated fashion. There are many problems where verifying the answer is easier than coming up with the answer.

Asserting formatted output can also be really useful. A picture might be worth a thousand words, but when it comes to tests it can save you a thousand asserts. Writing those thousand asserts separately also would be so tedious that in practice you'd probably not write them all, leaving part of your output uncovered by tests.

When I wrote a LALR parser generator for fun, I added some code to print out a nicely formatted parsing table with debugging information. Besides being useful for debugging, it let me write simple yet powerful tests: I would feed the generator a grammar and then assert on the formatted parsing table. That made it easy to verify that I was asserting the right thing, and let me assert everything in one go.

> locking in your implementation such that it can't change later

That's the whole point of tests. All tests do that.

This protects against later code changes that change behavior (output or side effects) unintentionally.

When you intend to change behavior then you need to change the tests tests too.

I disagree.

Tests should define what the expectations are. If a change does not impact those expectations, then it should be allowed and not break any tests.

Locking your code such that all future changes require updating old tests tells me that your tests are just your code written a second time, with no thought about what the code's requirements are.

In many contexts, there's just no such thing as a safe behavior change which should be allowed without a specific decision from you to allow it. As a database systems guy, I've seen countless examples of customer breakages caused by a developer's decision that some behavior or another is so trivial it doesn't need to be tested.

When you're working on developing a random utility function (real example!), it's easy to say "come on, it's no big deal to return DECIMAL(14, 4) instead of DECIMAL(12, 3)". It feels like they're basically the same, updating the test is make-work, and the guidelines saying you must document it as a breaking change are pointless annoyances. It's hard, requiring substantial amounts of knowledge and expertise, to recognize that this change will cause a production outage because the schema of a customer's view is no longer write-compatible with their existing data.

In your story though the hapless dev just changed the test. And the reviewers approved it.

This suggests that there are so many changes to tests that it's just become background noise.

It had, and that's precisely because of the lack of anything like the expect() tests described in the OP. It's laborious to reliably scan through a big test diff and identify when it's describing a user-facing change, and people are inevitably going to autopilot through it. If you have a golden file (the standard name in my area for an equivalent mechanism to expect() tests), the reviewer's work is a lot simpler: any non-append-only diff is a breaking change and must be either fixed or communicated broadly before deploying it.
Implementation !== Behavior. You want to test the behavior, not the implementation. I'd expect tests to change when behavior changes, but reimplementing the same behavior, the tests should pass when you're done.
Yeah in their Fibonacci example if it printed out 510 instead of 610 you'd still have a bug and think you had tested it. Especially confusing for future people who will assume it works because there are passing tests!
The title mentions writing tests as if they are repl sessions because you're supposed to iterate until you have the correct result.
How do you know if you have the right result though? You might know if you have a plausible result. Like if it output -1 then you know something is wrong I guess.

There's a much higher chance of detecting bugs that give plausible output if you aren't given the opportunity to say "eh looks plausible I won't bother double checking it".

Any programmer dumb enough to just blindly accept that their program is correct is also a dumb enough programmer not to have begun writing a test in the first place. If this gets the friction of writing a test at all so close to zero that these programmers start writing tests (albeit sometimes blindly accepting the output), then it's better than just trying their program on some inputs and calling it a day. It writes down the current output of the program. That's a big step up already. Now people evaluating the code can read some of its outputs without downloading anything.

I personally already use a similar cycle to expect-test when I write tests. A great place to start when writing test assertions is the debug output, just like this thing uses. Then you convert the output into assertions after you have thought through which parts are right or wrong. Just like you can do with expect-test, but without the automation. If you don't know whether the output is right or not, just add an assert(false, "hmm, not sure about this") aka todo!() and voilà, your test fails and future you can be prompted to check over it again.

Sometimes the output is obviously wrong, but you still don't know what the right output is. (At this point you know you're doing useful work!) The remedy is the same. Just make the test fail somehow.

> Any programmer dumb enough to just blindly accept that their program is correct is also a dumb enough programmer not to have begun writing a test in the first place.

Then what's the point of this methodology? It requires you to write tests and also blindly accept that your program is correct.

Maybe they should just rename it to "plausibility tests" or similar because that's what they're really testing. And while that does have some value, I think most of the value is negated by the fact that it sounds like they are properly vetted tests which they are not.

So a more appropriate name would help a lot. I still think it's a bad idea though.

> It requires you to write tests and also blindly accept that your program is correct.

No. You can say no. Just don’t accept it. You’re a human and it asks. Even if you do accept it you can modify it because you have eyes and a keyboard and it’s written right there where you wrote your test.

See https://github.com/rust-analyzer/expect-test for a demo gif of the rust version.

It's a repl, so you build the final output incrementally. Testing becomes part of the development workflow like you would do in languages that rely on the repl like lisps.

For example, you start with the inputs and you apply the first layer of transformations, then check what it does makes sense. Then maybe you refactor it out in its own function and add the generated test for it. Then you move on the next step and so on until you have the final result.

For Fibonacci (or indeed the result of most mathematical calculations) it makes no sense but I use this kind of thing all the time where the expected output is, for example, a templated string like an error message.

There are plenty of kinds of test outputs where rewriting the test and eyeballing the result is quicker, easier and ultimately better.

It makes sense in scenarios where it's easier to verify a provided solution than it is to create one.
If you’re autogenerating your tests from a specification and not an implementation then it can potentially be useful.
In many contexts there's value in ensuring the behavior doesn't change without being noticed. You're just moving the developer thinking about the expected behavior from when the test is written to when the test fails.
See the related memes "code never lies", "the code is the contract" and “when I use a word, it means just what I choose it to mean — neither more nor less."