| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jfager 4175 days ago

In the cases where your application can benefit from parallelizing simple operations over a large data set stored in a collection, `parallel()` is fine.

It's even fine in the case where you're pulling data from a file or other low-latency sequential data source, assuming that the cost of filling a spliterator buffer is less than your cost of processing.

But there's a list of gotchas all more dangerous than the "magic make it faster" button of .parallel() imply:

- For the sequential data source case, if the cost of filling the spliterator buffers is higher than the cost of processing, you're just wasting a ton of overhead trying to use parallel.

- You have to be aware that by default all uses of parallel() run on the same threadpool, which makes it a potential timebomb if someone uses it in the context of, say, a webserver where multiple requests might all individually process streams. This also means blocking operations during stream processing are very dangerous.

- Mutating an external variable goes from being fine for a sequential stream to a race condition for a parallel one.

- You can't hand out Streams that you intend to be executed sequentially, b/c your callers can just call parallel() whenever they want.

And, yes, all of these considerations make the api more complicated than one operating over plain old iterators.