Hacker News new | ask | show | jobs
by jnordwick 680 days ago
Yesterday, he posted on Reddit and I expressed some concern with the benchmarks. The benchmarks are claiming 0.36 ns of overhead per call, but only the computing function. There is a second thread running doing the schedule that the overhead numbers don't include. It seems pretty clear he's running on a hyperthreaded 8 core machine (so 16 threads), I was guessing 3 Ghz, so that literally a single cycle of overhead.

Each extra thread adds more overhead from lock contention. At 16 threads overhead is up to 3.6 ns (so 10 times more). I'm guessing, but that would mean the 0.36 ns of overhead included an uncontested lock? That's not possible. There's some other weirdness going on in the benchmark data too. So either I'm not understanding what he's actually timing or maybe there is a bug in the benchmark code.

Also, if you multiply all the values out, I think he's timing in milliseconds (when runtime is calculated and converted to millis, they come out as whole numbers). Don't most benchmarkers have better precision than that? Maybe he's just using `time prog` and the data is just really dirty. Or maybe he's just choosing really, really bad metrics that totally useless for this (this is probably correct, just not sure if there are more issues).

2 comments

I don't understand why it's unbelievable. They walk through the required logic to perform scheduling and dispatch; it looks incredibly simple and extremely suited to hiding on an OoO. The benchmark is summing down a binary tree which is going to give you a lot of space to hide what's at most a handful of instructions. There's obviously no lock contention because, like, look at the algorithm, what locks contending on what?

AFAICT this is just a really cool and really practical algorithm for microthreads with near zero cost per semantic thread.

> There's obviously no lock

What? The threadpool has a shared mutex accessed by each worker thread, which is used for pushing work on heartbeat and for dequeuing work. https://github.com/judofyr/spice/blob/167deba6e4d319f96d9d67...

"Adding more threads to the system will not make your program any slower" is an insane claim to be making for any lightweight task scheduling system, and trivially not true looking at this: if you have 1000 threads as workers and degenerate tree program than each worker will take turns serializing themselves through the mutex. Things like these are massive red flags to anyone reading the README.

The README covers this. Its argument is persuasive. If your point is that the constant is badly tuned for theoretical 1000 core machines that don't exist, I'm not sure I care. A 100ns stall at most every 100us becoming more likely when you approach multiple hundreds of cores is hardly a disaster. In the context of the comment I replied to, the difference between 8 and 16 workers is literally zero, as the wakeups are spaced so the locks will never conflict.

Actually, if you did have a 32k core machine somehow with magical sufficiently-uniform memory for microthreading to be sensible for it, I think it's not even hard to extend the algorithm to work with that. Just put the workers on a 3D torus and only share orthogonally. It means you don't have perfect work sharing, but I'm also pretty sure it doesn't matter.

The lock is acquired every 100 microseconds and it is expected that only one thread accesses it at a time.
I know. But I have no other explanation as to the latency increase that is seen as the number of threads increases. Do you have a better theory?