| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pedrocr 2674 days ago

For that specific example it's hard to be expensive but here's a loop from one of my crates that compiles down to efficient code:

    pub fn decode_12be(buf: &[u8], width: usize, height: usize) -> Vec<u16> {
      decode_threaded(width, height, &(|out: &mut [u16], row| {
        let inb = &buf[(row*width*12/8)..];

        for (o, i) in out.chunks_exact_mut(2).zip(inb.chunks_exact(3)) {
          let g1: u16 = i[0] as u16;
          let g2: u16 = i[1] as u16;
          let g3: u16 = i[2] as u16;

          o[0] = (g1 << 4) | (g2 >> 4);
          o[1] = ((g2 & 0x0f) << 8) | g3;
        }
      }))
    }

The "decode_threaded()" call is a function call that passes in a closure with the inner loop to a generic function that is used to multithread a bunch of similar decoders, instead of repeating that code. And the for loop is actually describing what I want (process every 3 bytes into 2 output values) instead of having me manage some iteration variables and have off-by-one bugs. "chunks_exact" and "chunks_exact_mut" are recent additions that allow me to say that I only want to receive exactly 3 bytes and output exactly 2 values, so if the array is improperly sized the extra at the end just gets skipped. This not only matches the intention of this code (I only ever process 2 pixels from 3 bytes and nothing else will work) but also gives the compiler a better way to lift the bounds checks out of the inner loop and make the code significantly faster (by 2x in some decoders).

Now let's see "decode_threaded":

    pub fn decode_threaded<F>(width: usize, height: usize, closure: &F) -> Vec<u16>
      where F : Fn(&mut [u16], usize)+Sync {

      let mut out: Vec<u16> = alloc_image!(width, height);
      out.par_chunks_mut(width).enumerate().for_each(|(row, line)| {
        closure(line, row);
      });
      out
    }

Besides allocating the image it uses "par_chunks_mut" to make the code threaded. That's a function that's provided by the rayon crate which manages a threadpool and it's scheduling for me. So I'm even using code written by other people, to build something that's apparently deeply nested even though it needs to be fast. And yet after all this indirection and syntax goodies the end result is efficient machine code that matches or is even better than the original C++ code and not the dynamic runtime like behavior of Ruby/Python that you'd expect from such code.

That's what zero cost abstractions are. After memory and data race safety I think rust shines because it gives me a lot of the ergonomics of Ruby with the speed of C.