|
> He does have a narrow view, but it does not make his claims invalid. I would tend to disagree on this, specially when claims come from the gamedev world. Games are presented as finished pieces (even when they aren't), and not just a release milestone. Ideally, a game is a one-off effort where you write a piece of code and if you're lucky, you won't have to touch it again. So, doing one-off optimizations instead of focusing on milestones and long-term maintainability of the code is not only a possibility, but actively encouraged. That's why until rather recently (20 or so years), assembly optimization for critical execution paths, if not for most of the product. Most of the rest of the software doesn't work like that. You often implement something that will be maintained, modified, extended and reiterated on for several years, not by you, but by several other teams with totally different experience and backgrounds. Or decades. Doing some fancy trick to skip a cleaner, extensible, maintainable design because you shaved off a couple of cycles on it is literally burning your employer's money and potentially causing huge issues in terms of maintainability, as many programs don't actually rely on a happy path like games do. The main reason modern systems are slow isn't (just) because programmers are lazy - Its because most software - unlike games - have compatibility and maintainability requirements, and more often than not, a huge legacy support. And also, in these systems, most of development time is actually spent maintaining and extending existing code, not writing new one. The author's assertion is fundamentally wrong, because software engineering is quite more than performance - even when it matters. Flashback to the beginning of the 90's, and "every game" used bresenham's algorithm to skip usage of the (slow or non-existent) div instruction. In some cases, a couple of bit wise shifts would also eliminate mul operations. These implementations were in some cases 2-4x faster than the classical counterparts, on a 12-40Mhz machine. Two cpu generations later, the Pentium comes out, and both mul and div take 1 clock cycle. The fancy pants implementation is now 3-5x slower at the same speed. Except now the cpu clock is 4x faster and shoveling around registers may actually impede parallel execution of code. All of this in a 5-year window. I envy the relatively stable instruction set of the last decade, where everything is sort-of predictable and assertions of speed can be made on code with a relatively high degree of confidence, but the reality is, silicon is cheap, and for most applications, performance is gained not by throwing away what makes some huge applications barely maintainable, but by deploying hardware. New, fancy, faster, cheaper and more economical hardware. Choosing a single metric (performance) and an instance in time to bitch about something is actually a disservice to the community at large. |
Where are you getting this information? Agner[0] lists DIV as taking 17 cycles at best (8-bit operand already in a register) on the P5, and MUL as taking 11 cycles. Even Tiger Lake takes 6 cycles for DIV.
There are ways [1] to beat that, but I don't think you can get it down to a single cycle.
[0]: https://www.agner.org/optimize/instruction_tables.pdf p.162
[1]: https://lemire.me/blog/2019/02/08/faster-remainders-when-the...