| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jart 766 days ago
	I would have assumed it just decodes the x86 into a 32-bit ARM-like internal ISA, similar to how a JIT works in software. x86 decoding is extremely costly in software if you build an interpreter. Probably like 30% maybe and that's assuming you have a cache. But with JIT code morphing in Blink, decoding cost drops to essentially nothing. As best as I understand it, all x86 microprocessors since the NexGen i586 have worked this way too. Once you're code morphing the frontend user-facing ISA, a much bigger problem rears its ugly head, which is the 4096-byte page size. That's something Apple really harped on with their M1 design which increased it to 16kb. It matters since morphed code can't be connected across page boundaries.

2 comments

kijiki 766 days ago

It decodes to uOPs optimized for the exact microarchitecture of that particular CPU. High performance ARM64 designs do the same.

But in the specific case of tracking variable length instruction boundaries, that happens in the L1i cache. uOP caches make decode bandwidth less critical, but it is still important enough to optimize.

link

innocenat 766 days ago

That's called uOP cache, which Intel has been using since Sandy Bridge (and AMD but I don't remember on top of my head since when). But that's more transistors for the cache and its control mechanism.

link

jart 766 days ago

It's definitely better than what NVIDIA does, inventing an entirely new ISA each year. If the hardware isn't paying the cost for a frontend, then it shovels the burden onto software. There's a reason every AI app has to bundle a 500mb matrix multiplication library in each download, and it's because GPUs force you to compile your code ten times for the last ten years of ISAs.

link

saagarjha 766 days ago

Part of it is that, but part of it is that people pay for getting from 95% optimal to 99% optimal, and doing that is actually a lot of work. If you peek inside the matrix multiplication library you'll note that it's not just "we have the best algorithm for the last 7 GPU microarchitectures" but also 7 implementations for the latest architecture because that's just how you need to be to go fast. Kind of like how if you take an uninformed look at glibc memcpy you'll see there is an AVX2 path and a ERMS path but also it will switch between algorithms based on the size of the input. You can easily go "yeah my SSE2 code is tiny and gets decent performance" but if you stop there you're leaving something on the table, and with GPUs it's this but even more extreme.

link

camel-cdr 766 days ago

Using the uops directy as the isa would be a bad idea for code density. In RISC-V land, vendors tend to target standard extensions/profiles, but when they hardware is capable of other operations they often expose those through custom extensions.

link

kiratp 766 days ago

IMO if the trade off is cheaper, faster hardware iteration then Nvidia’s strategy makes a lot of sense.

link