Hacker News new | ask | show | jobs
by jbosh 762 days ago
So scary. I was working on a garbage collection bug for a AAA video game one time for months. It'd crash once a day on a tester machine and we'd add more and more prints to try and narrow it down.

Finally got enough information and realized that the padding of a specific object was wrong (GC expected 16 bytes, object was 12 bytes). This caused dozens of other corruption bugs to disappear that we didn't even think were GC related.

1 comments

How is that possible? Which language was this in?
I don't know, but we had a similar bug in OCaml, although in reverse.

Linux/x86-64 expects the stack to always be 16 byte aligned (although the ABI documentation at the time didn't make this assumption very clear). OCaml called into C with a non-aligned stack. GCC-generated code, assuming the stack was 16 byte aligned, used some strange Intel AVX instruction that only works on aligned data, unlike every other Intel instruction ever that can work on any alignment (albeit maybe more slowly).

This manifested itself as rare and totally unreproducible crashes (because stack alignment differed between runs). It was a bit of a nightmare to solve.

The fact that msvc generates the unaligned loads for every avx instruction but gcc didn't gave me so many headaches. Most people worked on PC or Xbox and I was on the Playstation team. "oh boy, another one of these..."
Yes! It's one of those cases where when you've seen it before and know the catch with the instruction (probably vmovdqa) then you'll immediately recognise it. If you don't know it, it's very very mysterious. Why on earth Intel decided to make a handful of instructions require alignment is also a mystery to me.
Sweet mama speed. Although from what I understand it is more legacy speed cause you're losing all your time to fetching the memory anyway. But when processors were slower it was a meaningful amount.
The instruction wants to access one cache line, not two.
It makes more sense now that Intel and AMD retconned naturally-aligned 128-bit atomic loads into the ISA: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 (AMD's confirmation is in comment 10.)
I hear these stories of black magic bugs and I look back at my 8 year career wondering if I'm even progressing as an engineer. Thrown from one studio to the next and never truly staying long enough to be trusted to investigate such issues.

I have no idea how engineers that started in the 10's or 20's are going to rise up to take over those fro the 90's/00's. So much is abstracted, but games specifically need to understand what's under the hood. Because they can and will hit some of the nastiest edge cases.

AFAIK you should only need 16-byte stack alignment if you use vector instructions, so Linux/x86-64 doesn't mandate it in all cases
The problem is that if you call another function, you won't know whether that function is uses any instructions that require alignment. So in practice, only leaf functions can skip stack alignment. The ABI states that the stack pointer must be aligned to a multiple of 16, plus 8, before any `call` instruction.
C++, it was an in house engine to make everything fit in memory on PS3. Removing fragmentation gave ~10% of usable memory back.
That makes a lot of sense, I figured this would be impossible with a higher level language.
AAA game is likely C++
Probably a custom one built for an in-house engine in C++ if it's a AAA studio. Alternatively one of the various off-the-shell ones you can #include, though my money is on the former. Note: I am not GP, I'm making educated guesses about what may have happened :)
GC and AAA probably C#.
If C# they'd have been using the MS authored GC. It was probably a custom GC for C++ heaps. Quite a few games do this, it's a smart productivity hack. Unreal uses a simple form of mark/sweep GC for its C++ game state heap.

https://unrealcommunity.wiki/garbage-collection-36d1da

It works fine because most RAM in games is consumed by assets that don't need to be scanned.

Interesting fact: Unity doesn't use a Microsoft GC, or the Microsoft implementation of CLR.

They had integrated the competing Mono implementation early on, and it came with its own "stop the world" Boehm GC.

They've been trying to move to Microsoft runtime for years but it's slow going.

Unlikely. Though in the case of Unity, this can happen if you fail to uphold memory layout expectations when writing C/C++ code (or if you do something really bad in C#).

There would also be a debugger allowing to reproduce and catch this in an easier way.