Hacker News new | ask | show | jobs
by 37ef_ced3 2003 days ago
In your [1], are your input/output arrays aligned the same for the fastest SIMD and ISPC runs?
1 comments

Not in that codebase as it was a tutorial / wanted to ensure it's callable from safe Rust code so stuck with `_mm256_loadu_ps`. That code was just playing with dot product like lookup over vectors on CPU. The code I'm more interested in is trying to cram models into ~L2/L3 cache such that a CPU optimized model can be trained on GPU to be deployed on CPU.