| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jerf 1875 days ago

"A program structured as a sequence of short, tight loops over vectors/slices as described is more or less hitting the performance sweet spot of modern microarchitectures."

Sure, but even faster is not to loop over intermediate arrays at all, by virtue of never constructing them in the first place when they aren't necessary.

"Moreover, there's usually no need to allocate or copy the array in that sort of data flow. I mean, unless you're chasing worst case performance to make a point or something. Slice the buffers out of a pool and allow ownership of the data to follow the execution context, then you're free to modify it in place."

That starts getting into "I'm sure someone can come up with some solution that meets some of these goals". Whatever map you're talking about here isn't one that is defined as creating a new slice based on mapping a function over the old slice. I mean, it kinda sounds like you're saying "well, if you just write a conventional for loop you can do all this in one pass" to me? Which is my point? I'm not the one pitching for lots of array creation, it's people who insist on using maps and filters in a language that, of all the major languages, just isn't going to put in the optimization time to convert them back into loops under the hood.

2 comments

1_person 1869 days ago

I think the pragmatic current realization of a functional data flow pipeline as described in my earlier reply incidentally achieves exactly what you mention here:

> Sure, but even faster is not to loop over intermediate arrays at all, by virtue of never constructing them in the first place when they aren't necessary.

In those examples of this type of pipeline, the vectors you start with are slices of hardware device descriptor queues which reference a DMA region, and by progressive transformation of the receive buffers and composition with additional buffer regions through intermediate decoded states you produce reply buffers.

The hardware is programmed to sample only the packets of interest to this descriptor queue, and copies the packets matching this filter directly to the DMA region referenced in the descriptor it places in the queue.

With sufficiently sophisticated hardware it is possible to offload decode of increasingly large fragments of protocol logic or even entire applications.

Description of the operations in a language of common composable operations over the type of vectors of buffers allows the amount of any given application which is mapped to e.g. general purpose CPU, GPU or other FPGA/ASIC offload feature instructions to vary continuously over time or API surface at runtime pretty neatly, it creates breakpoints in the logic flow that more or less always map exactly to functional hardware boundaries, because everything's pretty much just vectors of buffers all the way down really when you think about it. Your process's entire runtime is just another vector of buffers to the kernel.

1_person 1874 days ago

> That starts getting into "I'm sure someone can come up with some solution that meets some of these goals".

The sooner we, as engineers, can collectively acknowledge that we're doing things so wrong it's costing us approaching 4 orders of magnitude performance, and agree to stop dismissing knowledge of the hardware as forbidden knowledge, and abusing acceptance of reality as pointless micro-optimization... then I'm sure we'll make progress towards accepting one of the many patterns people have been advocating for upwards of a decade which do solve these problems.

One of the patterns which happens to be able to reclaim most of those missing 4 orders of magnitude performance is functional data flow, which is by unfortunate coincidence essentially the pattern you're denouncing here for its performance. It actually maps almost directly to the ideal implementation because it's ... literally expressing problems in terms of vector operations with no data dependence, which maps perfectly to a parallelized pipeline of the instruction types that achieve near optimal throughput and realized IPC.

I am not trying to be a dick but I am very passionate about this topic and I disagree very strongly with your assessments of the performance here.