|
|
|
|
|
by jerf
1875 days ago
|
|
"A program structured as a sequence of short, tight loops over vectors/slices as described is more or less hitting the performance sweet spot of modern microarchitectures." Sure, but even faster is not to loop over intermediate arrays at all, by virtue of never constructing them in the first place when they aren't necessary. "Moreover, there's usually no need to allocate or copy the array in that sort of data flow. I mean, unless you're chasing worst case performance to make a point or something. Slice the buffers out of a pool and allow ownership of the data to follow the execution context, then you're free to modify it in place." That starts getting into "I'm sure someone can come up with some solution that meets some of these goals". Whatever map you're talking about here isn't one that is defined as creating a new slice based on mapping a function over the old slice. I mean, it kinda sounds like you're saying "well, if you just write a conventional for loop you can do all this in one pass" to me? Which is my point? I'm not the one pitching for lots of array creation, it's people who insist on using maps and filters in a language that, of all the major languages, just isn't going to put in the optimization time to convert them back into loops under the hood. |
|
> Sure, but even faster is not to loop over intermediate arrays at all, by virtue of never constructing them in the first place when they aren't necessary.
In those examples of this type of pipeline, the vectors you start with are slices of hardware device descriptor queues which reference a DMA region, and by progressive transformation of the receive buffers and composition with additional buffer regions through intermediate decoded states you produce reply buffers.
The hardware is programmed to sample only the packets of interest to this descriptor queue, and copies the packets matching this filter directly to the DMA region referenced in the descriptor it places in the queue.
With sufficiently sophisticated hardware it is possible to offload decode of increasingly large fragments of protocol logic or even entire applications.
Description of the operations in a language of common composable operations over the type of vectors of buffers allows the amount of any given application which is mapped to e.g. general purpose CPU, GPU or other FPGA/ASIC offload feature instructions to vary continuously over time or API surface at runtime pretty neatly, it creates breakpoints in the logic flow that more or less always map exactly to functional hardware boundaries, because everything's pretty much just vectors of buffers all the way down really when you think about it. Your process's entire runtime is just another vector of buffers to the kernel.