| > Caching? That's the first step. If you have a 64kB cache, then you want to fill that cache ONCE, calculate everything you can with that 64kB chunk of data. Save off the result, and then load a new 64kB chunk. This is called "tiling". Its actually kinda tricky to do just right, but once you know the concept, you basically spend effort ensuring that main-memory hits are minimized. GPUs have many different memory regions: global Memory, L2, L1, "Shared" memory, and finally registers. Maximizing your math and minimizing your memory-moves is one big part of optimization. ------- There are a ton of other optimization tricks: GPUs are more efficient if they access memory in certain ways. "Shared Memory" in GPUs are typically banked (on modern NVidia and AMD GPUs). If you have 64-threads, it is more efficient if thread#0 accesses X+0. Thread#1 accesses X+1. Thread#2 accesses X+2. (etc. etc.) Thread#63 accesses X+63. In fact, a GPU can perform all 64-memory loads simultaneously. However, if the 64-threads all access memory location X+4, you have a "bank conflict". The 64-threads can only read this memory location one-at-a-time, resulting in a massive slowdown. There are a huge number of tricks in "tricking" the for loops to access memory optimally across your many, many threads that are calculating the multiplication. -------- > Some kind of clever mathematical tricks? The clever mathematical tricks have been solved and are known. LU decomposition, etc. etc. Its important to know them, but that's not generally what people mean by "optimization". |