| HN Mirror

Yeah we definitely need to spend some more time on benchmarks after all it's said and done.

That being said, while gemm is one op, it's a lot more than just jni back and forth that use other libraries. What matters here are also things like convolutions, pair wise distance calculations, element wise ops, etc.

There's nuance there.

There are multiple layers here to consider:

1. The JNI interop managed via javacpp (relevant to this discussion)

2. Every op has allocation vs in place trade offs to consider

3. For our python interface, we have yet another layer to benchmark there (we use pyjnius for jumpy the python interface for nd4j)

4. Op implementations for the cuda kernels and the custom cpu ops we wrote. (That's where our avx512 and avx2 jars matter for example)

For the subset we are comparing against, it's basically making sure we wrap the blas calls properly. That's definitely something we should be doing.

We've profiled that and chose the pattern you're seeing above with f ordering.

That is where we are fast and chose to optimize for. You are faster in those other cases and have laid that out very well.

Again, there's still a lot that was learned here and I will post the doc when we get it out there to make that less painful next time.

You made a great post here and really laid out the trade offs.

I wish we had more time to run benchmarks beyond timing for our own use cases, if we had smaller scope we would definitely focus on every case you're mentioning here. We likely will revisit this at some point if we find it worth it.

In general, our communications and docs can always be improved (especially our internals like our memory allocation)

Re: your last point we do do this kind of benchmarking with tensorflow. For example: https://www.slideshare.net/agibsonccc/deploying-signature-ve... (see slide 3 and also the broader slides for an idea of how we profile deep learning for apps using the jvm)

We need to do a better job of maintaining these things though. We don't keep it up to date and don't profile as much as we should. It has diminishing returns after a certain point vs building other features.

I'm hoping a CI build to generate these things is something we get done this year so we can both prevent performance regressions and have consistent numbers we can publish for the docs.

Once the python interface is done that will be easier to do and justify since most of our "competition" is in python.