Not in that codebase as it was a tutorial / wanted to ensure it's callable from safe Rust code so stuck with `_mm256_loadu_ps`. That code was just playing with dot product like lookup over vectors on CPU. The code I'm more interested in is trying to cram models into ~L2/L3 cache such that a CPU optimized model can be trained on GPU to be deployed on CPU.