Hacker News new | ask | show | jobs
by dr_zoidberg 2255 days ago
I understand that some people look suspiciously at the 15GHz mark, specially considering this was run in a 4.5GHz processor. What I understand is that this benchmarks are comparing how long it would've taken on a stock 1Mhz 6502, and calculate the "clock speed" obtained as a ratio. So if I'm getting my result 10,000 times faster than a standard 6502, it means I'm at 10GHz.

I also understand that this is possible because the emulator is running on a superscalar processor. Not sure if multicore has anything to do here (the post specifically mentions the high performance of the single-core case for the processor used). Still, considering that processors back in the 6502 era had just one execution port, and superscalars this day have a lot (I think 8? I really lost track of what's usual these days), then the figure makes sense all right, and without involving any kind of multithreading.

Kudos to the authors of the emulator for having a super-optimized system that can effectively and efficiently emulate its target!

2 comments

I like the framing here, that of seeing this as a showcase of modern superscalar improvements. And yes, it's about single core performance only.

What is particularly interesting to me is how thoroughly superscalar "wins". Because of complexities with 6502 -> x64 mapping, and handling self-modifying code in particular, some of the most common 6502 instructions explode to multiple x64 instructions. Despite that huge extra instruction load, the translation still manages to run at much greater speed than a 1:1 instruction ratio.

Modern processors do not run on electrons. They run on unicorn tears and magic.

Note that there is also a speedup from dynamic optimization.
> I also understand that this is possible because the emulator is running on a superscalar processor.

It's also possible because the minimal architecture of the 6502 makes it inherently inefficient. With only three 8-bit registers -- which can't even be used interchangeably! -- and a non-addressable stack, a lot of CPU time on the 6502 is spent shuffling data around. Consider adding two 32-bit numbers, for example. On a 6502, this is a minimum of 38 cycles (clc + (lda, adc, sta) x4); an x86 can complete the same operation in one cycle, potentially in parallel with other operations.