Hacker News new | ask | show | jobs
by laurencei 743 days ago
I'm not a machine learning person - so I'm confused about this.

As someone who doesnt understand ML - I have always assumed the whole point of ML is to try different things in the game, almost randomly, and over (long) periods of time the AI gets better and better at the game.

If having a single unexpected event causes such a large swing in outcome, and the AI cant "explain" what is different to cause the swing, then what exactly is the ML doing for it to fail on such a seemingly simple change? Doesnt that defeat the whole purpose of this?

I'm obviously missing something obvious - because I would assume the real goal of ML is that it can teach itself the game, even if that involves unexpected situations, as a human does?

3 comments

This article doesn't describe it in detail. One scenario imaginable would be that they ran their model trained on non-full moon data for an evaluation on a full moon day. Which means the model would simply apply it's learned "optimal" action policy in a different environment, where the previously learned action policy doesn't lead to good scores anymore.
So does this mean if they allowed the game to run on "full moon days", it would be expected to eventually get a higher score (if the full moon day allowed that through the actual game mechanism)?
Yes, the full moon day can help you get a higher score due to the bonus luck you get that day. On the other hand, the full moon day makes werewolves (and wererats and werejackals) a lot more dangerous because they'll always be in animal form.

When you try to fight a werecreature in animal form it can summon large numbers of animals of its kind to attack you. This can be extremely deadly for a player who is unaware of this ability. An experienced player knows to attack werecreatures only at range or avoid fighting them altogether. However, encountering the werecreature in its human form is much less dangerous unless it's carrying a powerful weapon.

It's not a single event, it's more like a new general game state that was never seen during training. Imagine learning to play the violin really well and then someone changes the way acoustics permeate. It doesn't matter if you're a human or an ML algorithm, you're going to have a hard time playing like before.
But something is wrong in the learning, because as a human NetHack player who has ascended, I can say that we don't play radically different on full moons. Yes, the random numbers go your way slightly more, but that's about it.

This tells me the algo is trying to hard to predict the game or learn a decent static strategy, rather than make situational decisions.

The issue is with werecreatures on a full moon. Most humans exposed to Western culture (likely all NetHack players) have heard of werewolves. I think it’s safe to say that everyone who has heard of werewolves knows they are most dangerous on a full moon. Even if you are a total NetHack beginner you know to avoid these monsters on the full moon. The game even helpfully reminds you of this information both by telling you about the full moon and by having a werewolf howl incessantly when it’s on the same level. However the game does not explicitly fill in the gap for you. It expects culture to do that.

The advantage of human common sense over machine learning models — at least when it comes to role-playing games — is that we carry around a ton of this cultural information. A model trained only on NetHack — not on broader culture or folklore/fairytales/mythology/fantasy — is simply not going to be aware of this link between full moons and specific monsters becoming more dangerous. So if it’s developed a fairly naïve strategy of just fighting or avoiding everything in its path based on a model of relative strength then it’s going to be tripped up when an outside event (the phase of the moon) upends that model.

I disagree. There are a lot of idiosyncrasies about monsters in NetHack such that getting good at NetHack is 99.9% about learning the NetHack world, not the real world. The werecreature game mechanic is no different than the POI or HALLU effect, so I don't think the AI needs any special knowledge. I

bet it comes down to how much memory the algorithm has, since the transformation might occur way later than being bitten, while most poison kills are fairly quick. The problem is NetHack requires you to have at minimum 1000 turns of memory to know when to pray. Even more if you want to keep track of where stuff was.

Tfa states that the agent was trained for points, and an other user states that some critters are a lot more dangerous during full moons.

Wouldn’t be very surprising if the agent hyper-optimised farming those critters for points. It would not be able to change strategy if the cost/benefit of that farming changed massively, so would now be performing significantly worse.

Humans train simultaneously as they operate, and humans can see the message about the full moon.

If nobody includes the full moon message as input to the ML model, and tries to operate the ML model with the training it has achieved running in non-full-moon mode, its operating score in full-moon-mode may be lower.

Even if it had proportional training time against full-moon-mode to incorporate that into the model, if you don't tell it when full-moon-mode is active wouldn't the optimal behavior be to optimize the score for 27/28 days vs 1/28 days of the month?

If full-moon-mode is an input to the model, then it can trained to optimize for both scenarios.

https://nethackwiki.com/wiki/Time

I predict the next "annoying non-bug" will be Friday, June 13th of 2025.

So for ML to work, it has to know all permutations of a game? Does that mean ML is useless for non-deterministic games with random outcomes or procedural generation?
The thing it has learned previously needs to apply to the next run.

If you train an ML model on thousands of attempts at going around some racetracks where touching the walls slows you down, and the score is achieved by executing a fast lap, and the inputs to the model include where the car is and where the walls are, it should optimize towards avoiding touching the wall.

This behavior would likely still work even on new procedurally generated tracks that the model had not previously seen, as long as the relationship of inputs (car, walls) to desired behavior (fast lap) still applied.

If every N number of runs for a large value of N the game changes so that the walls are actually speed boosts and the center of the track slows you down, and there is no input to the ML model to tell it that the situation is different, it will initially try the previous strategy and perform worse, and it will be difficult to train it to handle both versions of the game without some discriminating input value to train on.

Well that is the whole trick. ML models ideally generalize from the training inputs to whatever new inputs show up during inference. For example, a vision model should recognize an image of a dog as a dog even if that exact image was not trained on. But that generalization always has limits. Usually score will decrease substantially the further "out-of-domain" the inputs are. So this model works fine when running a randomly generated dungeon it has never seen, but not when running a set of game rules it has never seen.
It still worked, just not as well. The program was trained, built it's intuition for the game. It had a set script and assumptions.

Some of those assumptions were different, and since it's not learning/training it couldn't adjust for those new assumptions, so it didn't do as well

If you a human, were forced to follow a set script/assumptions the same would happen to you.