Hacker News new | ask | show | jobs
by waynecochran 1144 days ago
Lacking a Linear Algebra / Tensor library is what kills performance for DNN. For example, this kind of element by element manipulation must be avoided:

            while (i < I * O): (i += 1) {
                self.weights[i] -= 0.01 * grads[i];
            }
What are the options for vector / matrix / tensor operations in zig?
4 comments

1. The compiler will vectorize simple operations like that pretty well.

2. Zig has built-in @Vector types that are fixed-size data types designed to compile down to things like SIMD as efficiently as possible given that you might be asking it to do 16x operations on a CPU only supporting 8x width SIMD. You'd often write your high-level code as a runtime-known iteration count over those comptime-known vector widths.

2a. Inline assembly or inspecting the architecture before choosing the @Vector width are both options, so you can write your high-level code with that information in mine if necessary (e.g., to make bolt vector quantization work well in Zig I'm pretty sure you need to inline-assembly one of the swizzling operations).

3. You can always link to LAPACK and friends. Zig has a great c interface.

4. Matrix/tensor ops aren't built-in. That doesn't matter for a lot of what this demo shows since it'll be RAM/cache bandwidth bound, but you'd definitely need to link in or hand-code inner product routines to have better asymptotics and cache friendliness if you were doing too many large matrix multiplies.

5. Wrapping any of the above into a library would be pretty easy. That sort of code is easy to write, so I haven't looked to see what other people have made in that space, but I'm sure there's something.

And their async/await is being redesigned IIRC, but at least the last version would make for a low overhead (in both developer time and runtime) way to parallelize any of the above. In the old version it was totally trivial to write a high performance parallel variant (knight's move and whatnot) sudoku solver, and I doubt the new version will be any worse. Not that the interior of a linear algebra operation is often the best place to apply parallelization, but if you had a good reason to do so it wouldn't be hard.

I'm not aware of anything in particular that would make multi-machine computations even slightly less painful than other languages, but maybe someone can chime in here with ideas.

C++ / Eigen performs a lot of optimizations that you might expect from a Fortran compiler. For example, it ability to broadcast and perform reductions is pretty slick, e.g.:

        X.noalias() = (A - B).colwise().squaredNorm().mean();
Short of having tensor cores, this does a good job of static optimization for underlying vectorization support.
> 2a. Inline assembly or inspecting the architecture before choosing the @Vector width are both options, so you can write your high-level code with that information in mine if necessary (e.g., to make bolt vector quantization work well in Zig I'm pretty sure you need to inline-assembly one of the swizzling operations).

Inline assembly is great but support for intrinsics would be really valuable for Zig IMO.

You can link in intrinsics too. Again, C compatibility is high.
I could write a `pub fn main()` that consisted only of a call to a C library that implemented world_peace(). But it wouldn't change the fact that "support for intrinsics would be really valuable for Zig IMO."
The goal of the post is to de-mystify, not to make something production-ready. Doing everything in plain code without any magic libraries helps people understand what's happening
To echo the sibling, while this should be avoided in Python, languages like Zig, C++, Julia, Rust can expect the compiler to SIMD-ify these expressions.
Somewhat, but I think people vastly over-estimate their ability.

A common example is if there's any accumulation/reduction, compilers will almost entirely fail to generate SIMD unless you use -funsafe-math-optimizations type flags, because of non-associativity of floating point. Sum of squares is the classic example (not saying that specific operation is used in NN).

Explicit vectorization (e.g., using intrinsics) is almost always a relatively simple way to get orders of magnitude speedup compared to auto-vectorization, because of the above. Also because data layouts usually need to change as well (AoS vs SoA, etc.), though NN people seem to write decent data layouts.

I don't have any experience with `#pragma omp` type approaches which may be a middle ground.

You need to move these sections into the GPU. Neither zig, rust, C++ nor C can do that alone. You'll need to use cuda or futhark. That's the real speedup for these types of things.

Optimization in this case largely concerns memory management in the GPU and keeping data transfer between cpu to gpu at a minimum.

Essentially a massively parallel API combined with a massively parallel processor. I'm thinking of doing an end to end tutorial about this.