It was "too good" in the sense that it was realistic enough that they forgot they were in a simulation, and thought the terminal window was a real native macOS window
It was realistic enough to make them expect that their keyboard shortcut would close the window. It wasn't realistic enough to actually do that. So the visual UI was too realistic for them to not have that expectation, but the behaviour was not realistic enough for it to fulfil the expectation.
That's what seemed confusing to me, since "it was so realistic that it didn't do what I expected when I pressed a certain key combination" seemed like a weird juxtaposition. Maybe it was the dash...