|
|
|
|
|
by kragen
879 days ago
|
|
a thing that puzzles me is how https://bbcmic.ro/?t=9ctpk is only 4× faster in emulation (about 30 seconds per scan line and so on the order of 2 hours for the whole image) i'm running this on a ryzen 5 3500u at 2400 megahertz. the acorn electron which supposedly takes 8 hours and 40 minutes is a 1 megahertz 6502 when running from ram, roughly, 262144 instructions per second. at 2 ipc one core of the ryzen should be about 4800 mips. if the emulation slowdown is 10× (10 host instructions to emulate one guest instruction), which is typical for naïve interpretation, it should be about 1000× faster on this laptop as on the original hardware (possibly the basic is in rom and so it's closer to 524288 instructions per second) emulation through qemu-like dynamic code generation should cut the 10× slowdown to 3×, so it should be 3000× faster than the original hardware where did that factor of 1000 go? not the javascript engine, surely? incidentally there is a rocket-ship button underneath the output screen image labeled 'send to beebjit' which runs the program in about a second |
|
Consider that each chip of a home computer system (CPU, 1..2 IO/timer chips, audio, video, ...) needs to do 'emulation work' for each 1 or 2 MHz clock cycle which can add up to quite a number of host computer instructions (dozens to hundreds).
If each chip emulator just takes around 10..20 host system clock cycles to emulate one emulator clock cycle, then you are already looking at around 100 host system clock cycles per emulated clock cycle for the entire home computer (in reality it's probably worse).
Such 'vertically sliced' emulation code is also full of conditional branches which put a heavy burden on the host CPU branch predictor.
...and that's how a theoretical 1000x speedup suddenly becomes a 10x speedup, it's not the CPU emulation (this is usually cheap) but the rest of the emulated system which can be expensive.
Different emulators use all sorts of tricks and shortcuts, but usually with tradeoffs like less precise emulation, or less 'compartmentalized' code.
PS: case in point this is just the top-level per-cycle function in my C64 emulator, which in turn calls per-cycle-functions in each of the chip emulators (which may each be just as much code):
https://github.com/floooh/chips/blob/9a7f6d659b5d1bbf72bc8d0...
I'm trying to strike a balance between 'purity' (e.g. an entire emulated system can be 'plugged together' by wiring together chip emulators, just like one would build a real 8-bit computer on a breadboard), while still being fast enough to comfortably run in realtime even in browsers (https://floooh.github.io/tiny8bit/c64.html).
It's definitely possible to implement a faster cycle-correct C64 emulator with a less 'pure' architecture, but it's quite hard to do meaningful optimizations while keeping that same 'virtual breadboard' architecture.
...considering all the code that runs per cycle it's actually amazing how fast modern CPUs are :)