| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 37ef_ced3 2013 days ago
	NN-512 (https://NN-512.com) Generate fully vectorized, stand-alone, human-readable C99 code for neural net inference, and understand exactly what's happening. For example, watch the code run with Linux's perf top and see the relative costs of each layer of the computation. Total transparency, no dependencies outside the C POSIX library

4 comments

joshuamorton 2013 days ago

In what sense is this "better"?

The generated code is like

    __m512i wfs16 = _mm512_castsi256_si512(_mm512_cvtps_ph(wf25, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC));
    fs16 = _mm512_inserti64x4(wfs16, _mm512_cvtps_ph(wf26, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC), 1);
    _mm512_mask_storeu_epi32(wfPtr1+230400+38400*i5+768*c2+128*k1+64*m2+16*f3, 3855, wfs16);
    _mm512_mask_storeu_epi32(wfPtr1+345584+38400*i5+768*c2+128*k1+64*m2+16*f3, 61680, wfs16);

(which is a set of 4 lines that appear in the middle of an ~800 line function).

That's not "human readable".

Sure you can use asan or gdb, but if gdb profiles slowly, what can you do? You're still at the mercy of the code generator to be able to optimize things.

link

37ef_ced3 2013 days ago

Google those _mm512_... intrinsics (they are part of GCC) to see what they mean. The code you pasted is converting single-precision floats to half-precision floats, and storing the half-precision floats to memory, 32 at a time. That's filter packing, which happens during initialization (and never during inference)

I agree, if you don't know anything about how convolution is implemented (filter packing, data packing, matrix multiplication, sum unpacking), you could be lost. But it's very shallow compared to a JIT or CUDA library scheme, and a knowledgeable ML performance engineer would have no difficulty

The inference function (at the end of the C file) is a series of blocks, each block corresponding to a convolution or other complex operation. It's straightforward to see which, by looking at where the weights come from (a field in a struct that has the same name as the layer in your graph)

If you use perf top (for example) you can see which convolution was most expensive, and why. Does the shape of the tensor produce many small partial blocks around the edge, so the packing is inefficient (a lot of tile overhang), for example? You can see that by glancing at the code and seeing that there are many optimized blocks around the edges. As a rule, if NN-512 generates small code for a tensor (few edge cases) you have chosen an efficient tensor shape, with respect to the tile

Or you might find that batch normalization is being done at inference time (as in DenseNet), instead of being integrated into the convolution weights (as in ResNet), because there's fanout from the source and a ReLU in between. You can see that easily in the generated code (the batch norm fmadd instructions will appear in the packing or unpacking code)

Is the matrix multiplication slow because there are too few channels per group (as in ResNeXt)? Easy to see in perf, make your groups bigger. Are you using an inefficient filter shape, so we have to fall back to a slower general purpose convolution? You can easily see whether Winograd or Fourier was used

And so on

link

akhilcacharya 2013 days ago

I’m truly baffled as to why such a sophisticated and useful package is being distributed and advertised by an anonymous individual.

link

signaru 2012 days ago

can happen if you're in a toxic workplace that will be more baffled that you have done awesome stuff in your free time.

link

magicfractal 2012 days ago

Probably they’re afraid because it might be related to their day job :/

link

webmaven 2012 days ago

> Probably they’re afraid because it might be related to their day job :/

A slightly more common scenario is an employer that insists on "we own everything, related to your job or not, that you do even on your own time and equipment" clauses in employee contracts even though such clauses don't happen to be enforceable in the relevant jurisdiction.

Rather than having to "clear through your manager and legal" every little thing to get it added to your contract's personal IP whitelist, publishing anonymously makes perfect sense, where the plan is to de-anonymize after employment ends, at which point (should said now-former-employer have a hissy fit), their own counsel will eventually inform them they don't have a leg to stand on. After sending at least one threatening letter, of course.

Another solution is to spam your manager (and legal) with every trivial 'invention' that pops into your head until they relent[0][1], but that can burn though political capital you may prefer to use for other purposes, and will probably only narrow the scope rather than remove the unenforceable clause.

[0] https://cr.yp.to/patents/tarzian.html (my favorite is invention #12)

[1] As examples I was seriously tempted to use: "Python, but with 1-based indexing", "LinkExchange, but for Wingmen", and "ROT-13 Markdown".

link

signaru 2012 days ago

it need not be related to your job. some employer might ask that since you're skilled enough to do such thing, then you should have been performing extraordinarily on the job, even if you are already delivering what the job asks for, and just as good as your peers. at worst, some struggling poorly managed startup might even "turnaround" and eventually you don't own your side passion project anymore.

link

gameswithgo 2013 days ago

i can read it! but then i spent months fiddling with intel intrinsics as a hobby

link

yudlejoza 2012 days ago

Great. Thanks!

1. Any particular reason you chose to avoid GPUs?

2. Did you benchmark your code's performance against GPU-centric codes (ideally for the same problem and problem-size)?

link

37ef_ced3 2012 days ago

The goal of NN-512 is efficient neural net inference on inexpensive, CPU-only cloud compute instances

For example, a Skylake-X cloud compute instance costs $10 per CPU-core per month at Vultr, and the NN-512 generated code does about 18 DenseNet121 inferences per CPU-core per second (in series, not batched)

In contrast, GPU cloud compute is almost unbelievably expensive. Even Linode charges $1000 per month, or $1.50 per hour (look at the GPU plans: https://www.linode.com/pricing/#row--compute)

As AVX-512 becomes better supported by Intel and AMD chips, it becomes more attractive as an alternative to expensive GPU instances for workloads with small amounts of inference mixed with other computation

link

yudlejoza 2012 days ago

I'm not disagreeing with you. I acknowledge that there may be a market for CPU-only NN tasks.

I think a thorough benchmark, either by you or by someone else, will only help your case, by giving a clear picture to those who need to make a decision.

Fun fact, GPUs are massively under-utilized during NN training. So it's quite possible NN on a good CPU might be only slightly slower.

link

freeone3000 2012 days ago

GPU underutilization depends on what, exactly, the model you're training is. It's not unreasonable to hit 80% or more of CUDA core usage on non-recurrent models like convnets, given sufficiently fast data pipelines and a reasonable batch size. Transformers and other recurrent functions hit 100% CUDA core utilization for large portions of each epoch, with the low-% usage on the comparatively short weight update at the end. As well, the current rule of thumb is that at the same price point (so a Xeon 4114 and a Nvidia Titan RTX) the GPU completes each epoch in 10% of the time as the CPU given the same compute graph... So it's highly unlikely that training will be anywhere close to as fast on a CPU as it is on a GPU.

link

ssivark 2012 days ago

GPUs are typically useful for training (due to massive parallelism), but not for inference.

link

touisteur 2012 days ago

Why not? You got thousands of tensor cores, or tflops under your hand, with already developed APIs, and if you're not too latency-sensitive you can batch a lot. Since you'll be doing the same inference operation millions of time, you don't have to re-prepare kernels and such, use cuda graphs or whatever is the flavour of the day for low overhead, repetitive computation? And if you want to scale a bit, you can add some GPUs before all the PCIe-lanes are all saturated, right? Apart from myriad-x and tpus I'm not sure what could be more useful?

link

bravura 2012 days ago

I just want to say that I'm very interested in this library and have commented on it before. I'd really like to see it reach feature parity with pytorch or theano and emit your C++ code on the backend.

For example, I am not aware that one can currently use your library to implement Wavenet, other audio generative models like Wavegrad, or transformers.

Keep up the good work.

link

DSingularity 2013 days ago

Yummy. Thanks. Gonna bookmark that one.

link