Hacker News new | ask | show | jobs
by shaftway 749 days ago
I don't buy this argument. Most game developers I know have said that unit tests are a waste of time so they never use them, but they're struggling with making changes to utility code and making sure that it doesn't do the wrong thing. Y'know, what unit tests are for.

I think the key here is that the perceived cost / benefit ratio is too high. It's the perception that drives their behavior though. I'm in a company now that has zero unit tests, because they just don't see the value in it (and in their case they may be right for a whole slew of reasons).

Also, remember that games are not very long-lived pieces of software. You build it, release it, maybe patch it, and move on. If the game moves to version 2 then you're probably going to re-write most of the game from scratch. When you support software for a decade then the code is what's valuable, and unit tests keep institutional knowledge about the code. But with disposable software like games, the mechanics of the game and IP are what's valuable.

Why would you write a unit test for something you know you're going to throw away in 6 months?

7 comments

I’ve seen people slog through untested code where they fear to make a change but I’ve also seen people slog through code with too much test coverage where the tests go through constant churn.

I don’t understand why people don’t just add one test even if the codebase otherwise has zero tests if they’re so scared of one area and I don’t get why people keep adding excessive coverage if it’s wasting their time.

It’s like people pick a stance and then stick with it forever when I couldn’t care less how I’ve been doing something for 10 years if today you showed me a better way.

>too much test coverage where the tests go through constant churn

This doesn't sound so much as too much coverage but rather like having your automated tests be coupled to implementation details. This has a multitude of possible causes, for example too the tests being too granular (prefer testing at the boundary of your system). I've worked in codebases where test-implementation detail coupling was taken seriously, and in those I've rarely had to write a commit message like "fix tests", and all that without losing coverage.

Even if the tests aren’t coupled to implementation details, in most projects the specification itself goes through many changes. Furthermore, as the implementation is being changed, it stops depending on some lower-level helper code and requires new code with a different purpose; the tests in the old code turn out to be largely (albeit not entirely) a waste of effort.

Changing specifications and code which turns out to be unnecessary aren’t ideal. but I believe they’re inevitable to some extent (unless the project is a narrow re-implementation of something that already exists). There are questions like “how will people use this product?” and “what will they like/dislike about it?” that are crucial to the specification yet can’t be answered or even predicted very well until there’s already a MVP. And you can’t know exactly what helper classes and functions you will use to implement something until you have the working implementation.

Of course, that doesn’t mean all tests are wasted effort; development will be slower if the developers have to spend more time debugging, due to not knowing where bugs originate from, due to not having tests. There’s a middle ground, where you have tests to catch probable and/or tricky bugs, and tests for code unlikely to be made redundant, but don’t spend too long on unnecessary tests for unnecessary code.

It feels like there are two levels of test writing proficiency. The first is writing the tests that have high benefit and low cost: e.g. pure functions with comprehensive tabular tests, simple method chains that have well defined sequential behavior and few dependencies, high value regression tests against detailed bug reports, etc. IMO it's harder to argue against writing these tests than to argue for writing them.

Then there's the second level of proficiency, related to what you're discussing with "test-implementation detail coupling". This is the domain of high test coverage, repeatable end-to-end tests, automated QA, etc. I've always struggled with this next level and I've yet to work in any environment where it was done effectively (if at all). It's also harder to argue for this kind of testing because the tests often end up brittle and false negatives drown out the benefits.

Moreover, most of the discourse centers around the first level of proficiency only and it's much harder to find digestible advice for achieving the second.

> This doesn't sound so much as too much coverage but rather like having your automated tests be coupled to implementation details

Depending on how high coverage you are aiming for, I find it hard to imagine a way to achieve it without inevitably tying the tests to implementation details

Decoupling tests from what they test does require a concerted effort, and is a skill that requires practice, but in general not as much as you'd think. Most devs are quite comfortable getting rid of coupling between two non-test components (functions, classes, services, whatever) of their system. The main mental hurdle seems to be treating your automated test code as any other code, capable and worthy of being decoupled.

There will always be some coupling between test and testee (for example, if I change my `double` function from doing `x -> 2 * x` to `x -> (x, x)`, my tests for `double` better fail), but there is a lot of coupling which is unnecessary, and this can often be removed. Like decoupling any two pieces of code, there is no one size fits all solution. There are some common sources of unnecessary coupling, and some rules of thumb to avoid them.

For example, let's say we have a (public) sorting routine `sort` which consists of two (private) phases: `prep` and `finish`. The implementation of `sort` would look like

  function sort(xs):
   prepped_xs = prep(xs)
   return finish(prepped_xs)
Let's look at two ways to write a test suite for `sort`.

The first is quite simple, just chuck in a bunch of unsorted arrays of varying sizes and degrees of unsortedness, and assert that what comes out is sorted:

  assert sort([]) == []
  assert sort([3, 2, 1]) == [1, 2, 3]
  etc
The other way is to make the tests more specific by testing `sort`, `prep`, and `finish` separately. For example, we might mock `prep` and do

  x = [1, 2]
  sort(x)
  mocked_prep.assert_called_with(x)
  // and something similar for finish
and have individual tests for `prep` and `finish`:

  assert is_prepped(prep([1, 2]))
  etc
  assert is_sorted(finish(prepped_xs))
  etc
Now suppose you decide that the code would be a lot more readable if some functionality of `prep` would move to `finish`. Nothing changes in the functionality of `sort` as this is purely a refactoring. When you make this change, the first test suite will pass, but the second will fail and require changes (i.e., it is coupled to implementation details of `sort`), because you've changed what `prep` and `finish` do.

So here, by increasing the scope of your test suite from `prep` and `finish` to `sort`, you've achieved looser coupling between the test suite and what is being tested. This can be applied much more generally: by testing at the boundary of your system (at a higher scope), you achieve looser coupling between your tests and your code. That is, you'll have fewer false failures.

> So here, by increasing the scope of your test suite from `prep` and `finish` to `sort`, you've achieved looser coupling between the test suite and what is being tested

But you won't have very good coverage of "prep" or "finish"

You can of course test those functions by carefully constructing test cases for "sort", but that essentially just re-introduces the coupling to implementation details at a higher level

That is indeed a drawback of this approach, and is problematic if and only if `prep` and `finish` are part of the public interface. If they're not, it's a worthwhile exercise to ask yourself what you want from your test suite, and whether or not individual tests for prep and finish support that goal. For me personally, my ultimate goal for a test suite is to get as close to the equivalence "the tests do not pass <-> pushing this to production would break the product" as humanly possible, as I really don't like spending time "fixing" tests. This leads me to test at the system boundaries, i.e., to leave the lower level tests out when I can. What value does testing stuff that will not affect production in any way bring me?

> but that essentially just re-introduces the coupling to implementation details at a higher level

The definition of coupling that I find useful is the following. A thing f is coupled to a thing g w.r.t. a change d in g if changing g by d requires you to change f. I find that it captures exactly the phenomenon which makes maintenance and extension of brownfield projects so costly.

So, for example, both test suites I gave as an example are coupled to sort with respect to the change "sort sorts stuff in descending order instead of ascending". Making this change requires you change pretty much every assert but the trivial ones, in both test suites.

With respect to the change "move some code from prep to finish" or "get rid of prep and finish entirely and move everything to the body of sort", only the second suite is coupled to sort.

This may not be the definition of coupling that you like to use. If it is, I don't see how the test suite of the first kind including edge cases for prep and finish at the level of sort is still coupled to implementation details.

This is the way. My work codebase has probably 5% unit test coverage -- it's frontend and a lot of it isn't sensible to unit test -- but I'm quite happy to have the tests we do. If it's nontrivial logic, just test it. If it isn't (it's trivial, it's aesthetic, whatever your reason)... just don't.
All the places I've worked for had some balance here, but it would definitely be on the very few tests end.

We would write tests to catch a bug in a low level system, and keep the test after. We had lots of Design by Contract, including Invariants that were enabled in debug mode.

But the reality was that we couldn't test gameplay code very well. That changed so dramatically over the course of a project that if we did test we would just end up commenting tests by the end of a project.

And as an optimisation guy, I would often have to change the "feel" of gameplay code to get performance out of code, which is checked by a Quality Assurance team, because it's subjective. That kind of stuff would make gameplay tests very brittle.

The pace of game Dev was incredibly fast. We were struggling to get all our stuff in, never mind adding any scaffolding that would slow us down.

Overtesting usually comes from TDD cargo culting where literally every function in the code, no matter how small, is a "unit" that needs to be tested.
Valve became serious about software quality in Dota 2 around 2017 - about 7 years after launch. Before that game updates were accompanied with lots of bugs that would take weeks to fix. These days, there are still tons of bugs, but much better than before. They just released one of the biggest updates in the game's history this week, and there are hardly any bugs being reported.

I am pretty sure there is some sort of automated testing happening that is catching these bugs before release.

Reminds me of an article about the testing infrastructure of League and Legends [1] back in 2016. 5500 tests per build in 1 to 2 hours.

Games are extremely hard to test. For me it falls into the same category like GUI testing frameworks which imho are extremely annoying and brittle. Except that games are comparable to a user interface consisting of many buttons which you can short and long press and drag around while at the same time other bots are pressing the same buttons, sharing the same state influenced by a physics engine.

How do you test such a ball of mud which also constantly changes by devs trying to follow the fun? Yes you can unittest individual, reusable parts. But integration tests, which require large, time sensitive modules, all strapped together and running at the same time? It's mindboggling hard.

Moreover if you're in a conceptual phase of development and prototyping and idea, tests make no sense. The requirements change all the time and complex tests hold you back. But the funny thing is, that game development stays in that phase most of the time. And when the game is done, you start a new one with a completely different set of requirements.

There are exceptions, like League of Legends. The game left the conceptual phase many years ago and its rules are set in stone. And a game which runs successfully for that long is super rare.

[1] https://technology.riotgames.com/news/automated-testing-leag...

I recall some Minecraft tests being saved worlds with redstone logic that will light a beacon green if it is working or red if not. That's usefull for games like that.

For games like Starcraft 2 with replay functionality, you could probably record/use several matches and test that the behaviour matches the recorded behaviour. If you can make your game have a replay feature you can make use of this, even if you don't ship that replay code.

For things like CYOA type games or decision trees, you could have a logging mechanism that prints out the choices, player stats, hidden stats, etc. and then have a way to run through the decisions, then check the actual log output against the expected output. -- I've done something similar when writing parsers by printing out the parse tree (for AST parser APIs) or the parse events (for reader/SAX parser APIs).

I'm sure there are other techniques for testing other parts of the system. For example, you could test the rendering by saving the render to an image and comparing it against an expected image. IIRC, Firefox does something similar for some systems like the SVG renderer and the HTML paint code.

Various of these features (replay, screenshots) are useful to have in the main game.

You're right about parts, which are mostly state machines. The have a defined input and output. Tests are straightforward to implement and adjust.

But recording and replaying matches? Taking screenshots and comparing the output? Just think about it: If you have recorded a match and change the hitpoints of a single creature, the test could possibly fail. And then? Re-record the match?

The same applies to screenshots: What happens if models, sprites or colors change?

In my experience, tests like this are annoying, because:

1) They take a long time to create and adjust/recreate.

2) They fail for minor reasons.

3) It takes time to understand, what such tests even measure, if someone else made them.

4) You need a large, self made framework to support such tests.

5) It takes a long time to run them, because they are time dependent.

6) They hinder you to make large changes.

7) It's cheaper to make some low wage game testers play your game. Or better, make the game early access and let 1000s of players test your game for free, while even making money out of them

Yes, when you are trying to intentionally change the output, you simply regenerate the gold file to be used as reference (and yes, it should be easy). It’s brittle for sure but it does catch unintentional changes and should be used where relevant (if sparingly). There are definitely existing frameworks that do this (eg Jest calls this snapshot testing and has tooling to make it easy).

I’m sorry your experiences with this kind of stuff have been bad. I’ve generally had good experiences in the machine learning space where we used it judiciously where appropriate but didn’t overdo it.

I don’t see how it can ever hinder you though - you can always choose to go “I don’t care that the output has changed dramaticallly - it’s the new ground truth” as long as you communicate that’s what happening in your commit. What it doesn’t let you do is that the output is different every time you run it but that’s generally a positive (randomness should be intentionally injected deterministically).

I doubt Dota 2 devs are writing code like this to test. The game is far too complicated, even more so than league, and changes a lot over the years, for this to be viable.

Dota 2 and openai had a collaboration in 2018ish, and during this time the Dota 2 bots system was reworked completely. They already can generate videos of every spell in action [1], and I would assume this is done by asking AI bots to demonstrate the spell. My guess is that before pushing out an update, a human looks at these videos and other more complex interaction videos for every major change, along with relevant numbers (damage, healing, movement speed), and see if everything makes sense.

I think this, because a lot of times recently, changes in one hero often cause an un-updated hero to break, because they had some backend similarity. And the patch is released with the bug.

Then again, there is no public info, so all the above are wild speculations.

[1] example https://www.dota2.com/hero/treantprotector

> They already can generate videos of every spell in action [1]

I'm fairly certain those videos are all handmade. (Yes, all 500+ of them.) Notice that the videos for each hero are recorded in different locations on the map, and the "victim" hero isn't always the same.

In my experience (full stack web development), unit tests are mostly useless and it is the high-level system tests which add real value. Unfortunately it can take a fair amount of work or skill to architect the test suite in the first place, but once it’s working you can write elegant tests that verify large swathes of code with fairly few lines of test code.

I think UI testing in general is hard though, and given how large a part of games involves UI, that’d be the real reason games don’t have much tests.

Agreed. We don't have a lot of UI unit tests in our software in our day job (almost none), but we have extensive tests for utility and data processing functions.

And that's pretty much the same for me in the game I'm making in my spare time. I have no unit tests for UI (it's not worth it, I can easily see when something in the UI isn't working, it's more important for me to just log the bug so I don't forget about it).

But for game logic, like verifying calculations for A.I. are happening as expected, or functions that manipulate numbers on the screen in different ways (scores, power adjustment, etc), yeah I write unit tests for those. And to the article's point, sometimes I have to significantly adjust or redo or even scrap them because I happened to think of a different mechanism and it seems to play better.

There was a long time (over about a dozen games I made) where I never bothered to write a unit test, and I still might not for a tiny game. But for my most recent bigger game (which I started a few years ago), I finally decided to write a few for some tricky numeric logic in the game, and it immediately helped me resolve a logic bug I was seeing periodically but was having a hard time pinning down the cause of it with breakpoints and logs. So I do try to do it more often for checking those things.

Part of it is terminology. You get "unit testing" and "functional testing" and "integration testing" and "system testing" thrown around, often with people meaning different things by these, and vague definitions that partially or wholly overlap.

My rule of thumb is really simple: a test should always be defined in terms of what the user expects. Thus for most apps you should, at the minimum, have tests corresponding to their functional specification. In addition, if the app contains functionality that is consistently reused inside (i.e. embedded libraries), then users of those libraries are the code that calls into them, and so there should also be tests at that boundary (but only after you wrote the high-level tests). Repeat recursively until you get to the bottom.

Testing is a continuum. I don't write a test for every change. Sometimes I spend a week writing tests for a simple change.

I will say that I've never said "I wish I didn't write a test for that". I have also never said, "your PR is fine, but please delete that test, it's useless".

I throw away a lot of code. I still test stuff I expect to throw away. That's because it probably needs to run once before I throw it away, and I can't start throwing it away until it works :/

What it comes down to is what else you have to spend your time on. Sometimes you need to experiment with a feature; get it out to customers, and if it's buggy and rough around the edges, it's OK, because you were just trying out the idea. But sometimes that's not what you want; whatever time you spend on support back and forth finding a bug would have been better spent not doing that. The customer needed something rock solid, not an experiment. Test that so they don't have to.

There are no rules. "Write a test for every change" is just as invalid and unworkable as "Never write any tests". It's a spectrum, and each change is going to land somewhere different. If you're unsure, ask a coworker. I have been testing stuff for 20+ years, and I usually guess OK (that is when I take a shortcut and don't test as much as I should, it's rarely the thing that caused the production outage), but a guess is just that, a guess. Solicit opinions.

> Also, remember that games are not very long-lived pieces of software. You build it, release it, maybe patch it, and move on.

This was true a couple decades ago. Nowadays many games are cash cows for decades. Path of Exile was released in 2013, Minecraft in 2011, and World of Warcraft in 2004, and all of those continue to receive regular updates (and have over the course of their lives) and still make plenty of money today. Dwarf Fortress has been in continual development since 2002! (Although probably not your ideal cash-flow model.)

Or you have the EA Sports model where you use the same "engine" and just re-skin some things and re-release the same game over and over. There has been a new "Football Manager" game every year since 2005 -- do you really think they throw out all their code and start over every year?

I maintain that the majority of games are still disposable, despite the occasional subscription model or long-lived hit that pops up. Remember that most games aren't made by AAA studios.

Wasn't Minecraft completely rewritten from scratch in Java after a few years?

And the EA one, like you said, it's just model updates. Very few gameplay mechanics get more than a simple tweak. Just recompile with the new models. You don't need unit tests if the code never changes.

the original minecraft is in java, it's probably gone through a lot of code transformation. The version you're thinking of is the microsoft version, rewritten in c++
I thought the MS version was C#?
I think Minecraft was originally written in Java and rewritten in a good programming language (i.e. not Java).
Whether or not one thinks C++ is a "good" language, I always thought that (original) Minecraft busted the myth that blockbuster games had to be written in C++.
Being written in Java was probably instrumental in enabling the huge modding community around Minecraft. Which in turn was probably in large part responsible for its success.
And more to the point for this thread, writing it in Java let Notch build and iterate extremely quickly. Minecraft originally came out of a 24 hour game writing competition in which most competitors were using C++, but Notch always used Java because coding speed was the most critical thing in that context.
It should have been written in C#, instead, the developers had to resort to silly optimization tricks that often never transpired.
You can add rigor to your decade-plus cash cow later, once it’s clear that you’ve hit the jackpot.
I wonder how many games have been released that could have been jackpots but weren't due to bugs and lack of rigor.

Tests also aren't just about rigor. They don't take years to pay dividends. The time you save just being able to develop and iterate without having to spin up the whole app and manually click through things to test them is huge. Not to mention the time saved hunting down regressions.

I still play games that came out a couple decades ago…
Let me guess... Super Metroid? Chrono Trigger? Final Fantasy VI? Ultima Underworld? Symphony of the Night?

There were a few decent games released in the '80s and '90s.

The fact that you put Ultima Underworld in the company of those masterpieces makes me think I should probably give that game a shot.

I only tried playing one Ultima game a long, long time ago, and I couldn't get into it. I'm guessing that one is a particularly good one, though.

Ultima Underworld is a bit different from the mainline Ultima games. Those are fairly regular RPGs (though I'd say that IV through VII are really good RPGs), but Ultima Underworld is more like an immersive sim. In fact it pretty much created the immersive sim genre. I personally kinda prefer UU2 over UU1, but they're both excellent.
I've set unit tests for functions that are mostly math (base cases of collision, parts of bot predictive logic, and similar).

Though I'm sitting at a hobbyist with electrical and commissioning background.

I am curious as to why your current company does not have unit tests. Do you mind sharing?
We produce a library that gets included in software made by our clients, and we have several thousand clients. The uptake on new releases is low (most of the clients believe in "if it ain't broke, don't fix it"). So every release has the potential to live in the wild and need support for a long time.

We're also in an industry with a ton of competitors.

On top of that, the company was founded by some very junior engineers. for most of them this was their first or second job out of college. Literally every anti-pattern is in our codebase, and a lot of them are considered best practices by them. Unit tests were perceived as a cost with little benefit, so none were written. New engineers were almost always new grads to save on money.

These facts combined make for an interesting environment.

For starters, leadership is afraid to ship new code, or even refactor existing code. Partially because nobody knows how it works, partially because they don't have unit tests to verify that things are going well. All new code has to be gated by feature flags (there's an experiment right now to switch from try-finally to try-with-resources). If there isn't a business reason to add code, it gets rejected (I had a rejected PR that removed a "synchronized" block from around "return boolValue;"). And it's hard to say they're wrong. If we push out a bad release, there's a very real chance that our customers will pack up and migrate to one of our competitors. Why risk it?

And the team's experience level plays a role too. With so many junior engineers and so much coding skill in-breeding, "best practices" have become pretty painful. Code is written without an eye towards future maintainability, and the classes are a gas factory mixed with a god object. It's not uncommon to trace a series of calls through a dozen classes, looping back to classes that you've already looked at. And trying to isolate chunks of the code is difficult. I recently tried to isolate 6 classes and I ended up with an interface that used 67 methods from the god object, ranging from logging, to thread management, to http calls, to state manipulation.

And because nobody else on the team has significant experience elsewhere, nobody else really sees the value of unit tests. They've all been brought up in this environment where unit test are not mentioned, and so it has ingrained this idea that they're useless.

So the question is how do you fix this and move forward?

Ideally we'd start by refactoring a couple of these classes so that they could be isolated and tested. While management doesn't see significant value in unit tests, they're not strictly against them, but they are against refactoring code. So we can't really add unit tests on the risky code. The only places that you can really add them without pushback would be in the simplest utility classes, which would benefit from them the least, and in doing so prove to management that unit tests aren't really valuable. And I mean the SIMPLEST utility classes. Most of our utility classes require the god object so that we can log and get feature flags.

I say we take off and nuke the entire site from orbit (start over from scratch with stronger principles). It's the only way to be sure. But there's no way I'm convincing management to let the entire dev team have the year they'd need to do that with feature parity, and leadership would only see it as a massive number of bugs to fix.

In the meantime developer velocity is slowing, but management seems to see that as a good thing. Slower development translates into more stable code in their minds. And the company makes enough that it pays well and can't figure out what to do with the excess money. So nobody really sees a problem. Our recruiters actually make this a selling point, making fun of other companies that say their code is "well organized".

Thank you for the write up.

That seems like a bad scenario with bad technical management. I am wondering if you have considered not trying to implement unit tests and think about end to end tests. This might be easier for antitesting people to buy into because it’s directly ensuring your end users get the desired outcomes.

It doesn’t matter what bad terrible practices you have inside your library if the output is correct…

If you input 1+1, and it outputs 5, it will be obvious how this can be an issue.

What this will enable you to do is get some quick wins and make refactoring safer.

If management still says no, I see 3 major choices.

1. Quit

2. Write your tests and keep them to yourself

3. Mind control

We do have an integration test that runs just before releases. I've never seen it fail, even when something was obviously broken, so I question the utility of it. There's a specific person in charge of maintaining it.

I've opted for option 4: continue to write code the way they want it written and keep cashing my paychecks. In the meantime there are tons of other improvements that I'm working on, some of which have a more direct impact on business revenue (which has a direct impact on my personal revenue).

As a corollary to 2, management tends to love graphs… whatever your using to build should have a plugin that could show unit test success counts and generate even a simple line graph… that alone might be enough incentive to add more testing
I wouldn’t use the term “unit test” if they are negative on the concept.

Edit; in fact, don’t say test at all. Talk about verification of the output

How do you stay sane working with clowns?
Many of the things he just described are a rational response to historical circumstance. It's fine to say "we're in a bad place" but that's not the same as saying "we're currently making bad decisions".
The inevitable end result of this approach though is that at some point they will be unable to ship new releases that meet the quality bar of their clients.
I get paid good money to work with clowns
Also, non-testable code is often faster (as in cpu time).