| I think you have a good point about sorts, in that case mutation is almost always faster, but it seems to me that a large contributor, is that often much data doesn't need to be moved. So clearly no work is better than some work. Having said that, I hardly see a sort being the majority of time spent in real-world code. > Because thread-barriers are no-ops on a SIMD computer My understanding is that only holds true across the SIMD lane, and the algorithm doesn't hold up so generally to general purpose concurrency. But agreed, a sort is something I'd likely try to optimize on a single core, and express program concurrency outside that specific operation. > What algorithms are you thinking of where copying is faster than mutating? At best, a copy ties mutation in speed. My main point is it's actually the case that a copy ties mutation in speed quite often if you structure the flow of your data correctly. If you're operating on a value, writing it somewhere else is often close to free. Having said that as an example, I've seen filtering a list run faster doing a move, highly dependent on the details of the cache of course. > that can beat the Knuth Shuffle's O(n) computation complexity. I haven't considered shuffles specifically. But getting into the rut of complexity evaluation alone dismisses important aspects of the machine and the data you are actually operating on, and can lead you very far from the fastest solution for the problem you are actually trying to solve. |
So, here's something. I think that a copying-based methodology can be fast, but more importantly, a copying-based methodology can be easier to parallelize.
If a mutating state is too hard to parallelize, I think rewriting things to be copy-based, and then using that copy-based methodology as a basis for parallel code, is a good idea.
What's difficult is that "mutating state" is a local maximum so to speak: probably the fastest that any single-thread can get to. From this perspective, finding parallelism from a "higher level" can be more useful than trying to parallelize a well-optimized single-threaded program.
----------
Case in point: Multithreaded producer-consumer queues are difficult to write and even define. But multithreaded producer-producer queues (and consumer-consumer queues) are very easy: just an obvious "atomic_add(tail, 1)" (Producer) and "atomic_subtract(tail, 1)" (for consumers) kind of thing.
I call it a "producer-producer" queue, because you need assurances that no one is consuming from the queue for it to work. But if you synchronize your threads to all copy to a queue (and no one consumes from it), its really fast, and actually very parallelized. Ditto for the reverse (the consume phase).