Hacker News new | ask | show | jobs
by dwattttt 728 days ago
You mention the most reasonable trick is to just avoid hammering/read-write the same cache lines; I guess you didn't hit this as a need because your QEMU instruction pipe was fast enough, but would you batch up your events along cache lines, fill up a line, and signal it's free to the consumers instead?
1 comments

Yeah. So I forget the original tuning I did for that project. But, I fill up a buffer which is on its own cache line (Chunk) and then signal that the chunk is ready for ownership on the other side, thus sending it. I’m not sure why the signaling atomics aren’t on their own cache lines, I imagine I tried both and this was faster? There’s also a chance I never tried it because I felt I didn’t need it? I’m not sure!

Edit: Yep, looking at this I think I see some room for improvement in design. I'm getting about 60-70 ns/iter right now on my Intel(R) Core(TM) i9-9900KS CPU @ 4.00GHz with turbo on, and a loaded up system. This code is 2 years old and I've gotten a lot better at designing stuff like this. But, it's still really good. It's something I'd like to revisit at some point. The main bottleneck I think I see is `cur_seq` should be on its own cache line, as that's the only heavily thrashed value between cores. Most of the others are primarily in the shared cache line state, until a writer hits a mailbox. This design searches linearly through mailboxes for work to do, perhaps it would be faster to save the previous index? I'm also not sure I always need to have readers strongly sequenced, such that they can process more than one message a time, but this could have been a requirement of the way I was using it? While in theory storing/caching some information is great, in practice it often means unpredictable and dependent loads to use them. Eg, storing the previous index now means the CPU has to fetch the previous index from memory, and then loop on it. The compiler also has to treat this as a "volatile" value, and thus will struggle to do things like loop unrolling and bounds checking removal. With a low N (~4-16, number of mailboxes), it's probably always faster to just `for _ in 0..16` instead of doing smarts here, as the processor can do all 16 of those effectively in parallel by the time the CPU has even loaded the "cached index". Once again, tuning is extremely specific to a workload and parameters.

For maximum performance, it's pretty much always required to try out some un-ergonomic API. In this case the sequencing is nice, but perhaps I could rewrite the code that uses mempipe to not care and handle sequencing someplace else? I forget. In theory I could have multiple banks of tickets, and switch between them on some level of pressure or timeout. Using the same signal line between the writer and the reader (eg, they both must write to client_owned) is probably a mistake. I think it'd be better to use two indexes, bumped on one side when a writer writes, and bumped on the other side when a reader is done. This would allow better "bursting" of data in my opinion? Who knows, even a data structure as simple as this effectively requires building it and trying to really determine how it performs. So many resources in use in the CPU and it's a tough balance in your head.