Hacker News new | ask | show | jobs
by judofyr 680 days ago
Author here.

I knew that some people would react negatively to the term, but I can assure the intention is for you to have a better understanding of exactly how and when you should use Spice and Rayon. I would recommend reading the benchmark document: https://github.com/judofyr/spice/blob/main/bench/README.md.

What people typically do when comparing parallel code is to only compare the sequential/baseline with a parallel version running at all threads (16). Let's use the numbers for Rayon that I got for the 100M case:

- Sequential version: 7.48 ns.

- Rayon: 1.64 ns.

Then they go "For this problem Rayon showed a 4.5x speed-up, but uses 16 threads. Oh no, this is a bad fit." That's very true, but you don't learn anything from that. How can I use apply this knowledge to other types of problems?

However, if you run the same benchmark on varying number of threads you learn something more interesting: The scheduler in Rayon is actually pretty good at giving work to separate threads, but the overall work execution mechanism has a ~15 ns overhead. Despite this being an utterly useless program we've learnt something that we can apply later on: Our smallest unit of work should probably be a bit bigger than ~7 ns before we reach for Rayon. (Unless it's more important for use to reduce overall latency at the cost of the throughput of the whole system.)

In comparison, if you read the Rayon documentation they will not attempt to give you any number. They just say "Conceptually, calling join() is similar to spawning two threads, one executing each of the two closures. However, the implementation is quite different and incurs very low overhead": https://docs.rs/rayon/latest/rayon/fn.join.html.

(Also: If I wanted to be misleading I would say "Spice is twice as fast as Rayon since it gets 10x speed-up compared to 4.5x speed-up")

3 comments

Thanks for the answer, this part is particularly interesting indeed:

> Despite this being an utterly useless program we've learnt something that we can apply later on: Our smallest unit of work should probably be a bit bigger than ~7 ns before we reach for Rayon.

That's a very interesting project.

The big limitation I see with the current approach is that the usability of the library is much worth than what Rayon offers.

The true magic of Rayon is that you just replace `iter()` with `par_iter()` in you code and voilà! now you have a parallel execution. But yes it has some overhead, so maybe Rayon could try and implement this kind of scheduling as an alternative so that people pick what works best for their use-case.

Too late to edit so I'll put it here:

> the usability of the library is much worse than what Rayon

I'm a little bit ashamed to see that this fairly upvoted comment of mine has such an stupid English mistake in it…

Don't worry about it: my brain "autofilled" wor.. as worse.
Adding more cores doesn't change the time per operation. Your graphs are grossly wrong. What you should have done is drop the nanoseconds and just take the total execution time. Whenever you're writing 1.64ns, you should have written 164ms.

The overhead should be measured as a percentage versus a theoretical base line such as perfect linear speedup. You haven't shown the ideal scenario for each core count, so how are we supposed to know how much overhead there really is?

The single core scenario is 363ms and linear speedup for 32 cores gives us 11.3ms. Your benchmark says you needed 38ms. This means you achieved 31% of the theoretical performance of the CPU, which mind you is pretty good, especially since nobody has benchmarked what is actually possible while loading all cores by running 32 single threaded copies of the original program, but you're advertising a meaningless "sub nanosecond" measure here.

You can just divide the speed-up by the number of cores, and that gives you the parallelization efficiency.

I've seen systems that can achieve 99% efficiency on thousands on nodes for real useful applications that involve non-trivial synchronization. Now that is an impressive feat.

Sure, there is probably some extra latency to get everything running, but for a sufficiently long program run, that is all irrelevant.

> You can just divide the speed-up by the number of cores, and that gives you the parallelization efficiency.

The linked benchmark document does this. 82% for 4 cores on Spice with a small workload, and 69% for 16 cores on Spice with a large workload. Compared to about 25% for Rayon on 4 cores with a small workload and 88% for Rayon on 16 cores with a large workload.

> Sure, there is probably some extra latency to get everything running, but for a sufficiently long program run, that is all irrelevant.

The entire point of the linked benchmark README.md is to deal with insufficiently long program runs. Spice is an attempt to allow parallelization of very small amounts of work by decreasing fixed overhead. Perhaps such a thing is not useful but that doesn't prevent it from being interesting.