|
|
|
|
|
by tgtweak
65 days ago
|
|
There is a huge market for "its faster" at the cost of efficiency, but I don't think your claim that an EML hardware block would be inherently less inefficient than the same workload running on a GPU. If you think it would be, back it up with some numbers. A 10-stage EML pipeline would be about the size of an avx-512 instruction block on a modern CPU, in the realm of ~0.1mm2 on a 5nm process node (collectively including the FMA units behind it), at it's entirety about 1% of the CPU die. None of this suggests that even a ~500 wide 10-stage EML pipeline would be consuming anywhere near the power of a modern datacenter GPU (which wastes a lot of it's energy moving things from memory to ALU to shader core...). Not sure if you're arguing from a hypothetical position or practical one but you seem to be narrowing your argument to "well for simple math it's less efficient" but that's not the argument being made at all. |
|
What? Unless the thing you want to compute happens to be exactly that eml() function (no multiplication, no addition, no subtraction unless it's an exponential minus a log, etc.) or almost so, it is unquestionably less efficient. If you believe otherwise, then please provide the eml() implementation of a practically useful function of your choice (e.g. that Arrhenius rate). Then we can count the superfluous transcendental function evaluations vs. a conventional implementation, and try to understand what benefit could outweigh them.
> A 10-stage EML pipeline would be about the size of an avx-512 instruction block on a modern CPU
Can you explain where you got that conclusion? And what do you think a "10-stage EML pipeline" would be useful for? Remember that the multiply embedded in your Arrhenius rate is already 8 layers and 12 operations.
Also, can you confirm whether you're working with an LLM here? You're making a lot of unsupported and oddly specific claims that don't make sense to me, and I'm trying to understand where they're coming from.