Hacker News new | ask | show | jobs
by zinxq 1647 days ago
Came here to hope someone wrote this. Wasted many hours of my young life trying to figure out why my self-modifying assembler program worked perfect in the debugger but not without it.
1 comments

Could you elaborate? I feel like the OP is saying that it works just painfully slowly but your comment/problem indicates it didn't work at all without the debugger. Can you say how these relate? Am I missing something obvious?
Self modifying code on x86 has always been a bit microarch dependent as in the past there was a minimum distance required for the processor to notice the change. This is was one of the tricks anti piracy code used to keep people from reverse engineering it. Changes made close enough to the IP wouldn't be hazarded properly so the stale instruction would be executed anyway. If this code is run under a debugger the extra break/traps would change the behavior and the newer instruction would get executed rather than the stale one. Someone who plays with this on more recent x86's could talk about how this presumably works on modern x86's, but I would guess that if the CPU detects a hazard and has to roll back to a previous state, it probably goes into some kind of strong in order mode around the code in question. This might mean modern processors behave better than some older models, but likely there is a absolutely massive perf hit when this happens (think > 10x). On something like an arm or risc-v(?) without coherent I-D caches this "window" could basically be forever, it makes an interesting question around security because in theory its possible to have code being executed for extended periods of time which isn't actually visible anywhere due to page/cache invalidation not clearing stale cache lines.
This is really fascinating subject and seems like a rich area for research.

I was curious about your comment regarding arm and risc-v not having coherent instruction and data caches. Is this a toggle on these chips hen for turning it on and off? I think I remember reading about some SoC that have this configurable.

On older x86 chips - anything before the 486 - there was no cache, but opcode bytes were prefetched into a queue, ranging from 4 bytes (on the 8088) to 16 bytes on the 386. The 286 and 386 had an additional queue holding up to 3 decoded instructions (regardless of length).

These queues where "visible" in their effect on self-modifying code. After modifying one of the instructions that could be already in the queue, you had to do a jump to flush it.

If you know about this, it is obvious how certain code only works when single-stepped through, or perhaps when run on an 8088 with its shorter queue. But few people did, even among experienced programmers.

IIRC the 486 and everything newer can detect when a cache line containing code is changed, so this is no longer necessary (but bad for performance as other commenters said).

Oh wow this is fascinating bit of history. I wonder how the i-cache drift detection is implemented. Cheers.