An L1-resident algorithm can outperform one that needs to talk to DRAM each iteration by 100x or more. In terms of wall clock time, this can mean the difference between minutes and days.
Would you be willing to try a fleeting idea if it took 2 days to test? How about if we could bring that down to 15 minutes?
Would you be willing to try a fleeting idea if it took 2 days to test? How about if we could bring that down to 15 minutes?