Hacker News new | ask | show | jobs
by RetroSpark 2344 days ago
"The DMA registers [...] run the full CPU 3.6Mhz speed. [...] I put a 32 byte function that would draw a scanline of polygon data in there."

The SNES supports SlowROM and FastROM but, interestingly, its internal memory is "SlowRAM". This seems like it would significantly bottleneck the system's performance. I guess "FastRAM" was just too expensive.

Another World uses SlowROM, so almost everything in the system is slow. It's a clever trick to use the DMA registers as a tiny amount of "FastRAM" and run code from there.

Something similar has been done on the Game Boy Advance - it's faster to run code out of Video RAM than directly from ROM or (the majority of) normal RAM.

5 comments

There’s a FastROM conversion for the Super Ghouls and Ghosts. The original was plagued with slowdown while the conversion runs very well. This leads me to think that the game was originally designed with fastROM in mind, but then they took manufacturing cost into account.

how much this actually was, I don’t know. Burger Becky mentioned begging her boss to use Fast ROM when interplay made RPM Racing but was also denied.

The trade offs of 1991!

Edit: of course Nintendo wasn’t as cost constrained. It was never an even playing field for 3rd parties

Moving to the much later Gamecube, it supported a locked cache mechanism that inter-operated with DMA. That way you can get your code in half the cache and streaming the data through the other half via DMA. Code runs guaranteed at max CPU speeds, never evicted. Data speeds are also fast when you orchestrate your DMA to cache nicely.

Instead of fancy locked cache or other cache tricks, the PS1 and PS2 also had scratchpads that ran at full speed. 1K for PS1 / 16KB for PS2. Very useful.

What's annoying about that memory space is it's fragmented: $4300-437f is the 128-byte region for DMA registers, but $43xc-f aren't usable (well, $43xf mirrors $43xb for whatever reason. $43xc-$43xe are open bus.)

So basically every 12 bytes, you get 4 bytes that are no longer usable before the next 12 bytes.

I haven't actually looked at this game's code, but it's certainly clever if the author found a way to avoid having to perform unconditional jumps in there that would sacrifice most of the gains in performance.

I'm not sure the ricoh 5A22 has large enough immediates for it, but there is a easy way to avoid a jump over a small amount of dead memory (example in x86):

  430B  3D -- -- -- --  cmp eax dead32  # only affects flags
  4310  xx              dowhatever
See, eg: http://www.muppetlabs.com/~breadbox/software/tiny/revisit.ht... , where it's used to skip over a mandatory header field.
An unconditional branch should be one of the fastest instructions. Why would that sacrifice most of the gains in performance?

It should cost about the performance gain from two of the effective instructions.

I used a debugger - it looks like the function in the DMA registers is actually just 32 bits:

  $4317 mvn src,dest
  $431a rts
Each of `src` and `dest` is either $7e or $7f, so this code performs a RAM-to-RAM memcpy.
Okay, a) that's very clever, but b) mvn is really quite slow. DMA would be faster (presuming the data is on two separate buses, you can't perform RAM -> RAM DMAs.) Barring that, a manually unrolled loop in a slow memory area would definitely beat out an mvn in a fast memory area.

I guess it's easy to judge this 25 years later with all we know now. That was a very cool idea to have implemented back then! Putting the mvn there would definitely be a boost compared to having the mvn be in a slow ROM area (6 master clock cycles per byte transferred.)

Source for the GBA version? I was considering looking into that one too.
Re-reading the parent, I think he meant that in general not OotW specifically.
That's right. Sorry if it wasn't clear.
Interesting to note it was one 32 byte function in those registers netting a 10% speed increase. Reminds me of demo scene tricks. That function was for sure hand written assembly/machine code. It is something of a lost (and not needed by most people) art these days.
I feel like the nineties demoscene equivalent is taking over the machine (x86) and putting some code and important variables into cache lines that never get evicted and all of your data runs through the other cache lines by mapping of that data sparsely in memory.