Hacker News new | ask | show | jobs
by Smerity 2003 days ago
I work in the space and was impressed with NN-512 as there's a painful gap in inference cost between CPU and GPU that doesn't have to exist. Intel and AMD are really missing a boat here, most other companies have enough cash they just go to GPUs, academics rarely sling low level code even in CUDA let alone AVX-512, and other than Fabrice Bellard's work few I've seen few go that low level.

My suggestion would be to focus on an initial use case where a very limited low cost / high efficiency CPU model can provide massive advantage. NN-512 should be the framework that expands from that Redis like core. The limited use case tactic is what I'm focusing on[1], mainly as I have a particular application and have less technical brilliance than yourself so need to focus ;)

An aged but still relevant example is the early word2vec work which was (and still is) frequently better to throw onto CPUs than GPUs. A well tuned implementation is not only advantageous on CPU but can win out in many scenarios where cost / latency / ... are important.

Congrats on the project though! I'd be curious for your thoughts for the future if you ever want to chat =]

[1]: Initial experiments written up as a tutorial with Rust and ISPC for a specific CPU based NN task - https://state.smerity.com/smerity/state/01E8RNH7HRRJT2A63NSX...

1 comments

In your [1], are your input/output arrays aligned the same for the fastest SIMD and ISPC runs?
Not in that codebase as it was a tutorial / wanted to ensure it's callable from safe Rust code so stuck with `_mm256_loadu_ps`. That code was just playing with dot product like lookup over vectors on CPU. The code I'm more interested in is trying to cram models into ~L2/L3 cache such that a CPU optimized model can be trained on GPU to be deployed on CPU.