Hacker News new | ask | show | jobs
by vlovich123 356 days ago
We actually do have the abstractions but the problem is that the vast majority of for loops don’t benefit - you need to have so much work that the overhead of coordinating the threads is sufficient. Additionally, you’ve got all sorts of secondary effects like cache write contention that will fight any win you try to extract out of for loops parallelism. What we’ve been learning for a long time as an industry is that you benefit most from task level parallelism with minimal to no synchronization.
2 comments

Granted this probably isn't the parallel application that the other poster was envisioning, but it can be extremely useful when a computation depends on a large number of I/O-bound tasks that may fail, like when you are servicing a request with a high fan-out to other services, and you need to respond in a fixed time with the best information you have.

For example, if you need to respond to a request in 100ms and it depends on 100 service calls, you can make 100 calls with a 80ms timeout; get 90 quick responses, including two transient errors, and immediately retry the errors; get eight more successful responses and two timeouts; and then send the response within the SLA using the 98 responses you received.

That doesn't require parallelism, just concurrency. But yes, you'd use a similar task-local map/reduce construct to express doing a bunch of concurrent I/O in parallel (spawning each I/O on a separate thread would be counter-productive & a hack to enable not adding an event loop / async I/O).
Let the engine or compiler decide if it is small enought to run on one core.
I’m only familiar with Fortran and my openMP is a little rusty. But, I think there are different pragmas for vectorization or threading. So, you have to tell it to do one or the other (is that wrong). Instead of expressing “well we don’t have dependencies here so do what you will.”
That would be great if the engine or compiler had that kind of capability but building that requires solving the halting problem.

Even if you try to do it with heuristics, go ask Itanium how that worked out for them and they tried a much simpler problem than what you’re proposing.

I am not familiar with the Itanium story and I don't know who to ask.

But it seems to me like this would be a safe space to experiment. With heuristics and pragmas as a fallback. Because with the right approach solutions would mostly be better than not doing anything.

And you could do it in runtime when you know the size of the input.

And what about applying the logic to places where you can see that the loop will end?

I believe query planners in for example Trino/BigQuery do this already?

> I am not familiar with the Itanium story and I don't know who to ask.

https://en.wikipedia.org/wiki/Itanium#Market_reception

> A compiler must attempt to find valid combinations of instructions that can be executed at the same time, effectively performing the instruction scheduling that conventional superscalar processors must do in hardware at runtime.

> In an interview, Donald Knuth said "The Itanium approach...was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write."[222]

There were lots of people but the gist of it is that scheduling auto-optimization is not really possible to do statically at compile time and too much overhead to attempt to do generically at run time. In this case it was instruction parallelism, but the same fundamental issues apply to this kind of auto loop idea.

> I believe query planners in for example Trino/BigQuery do this already?

That is completely different from trying it on every for loop in a program. They do it at the broader task level to determine how much parallelism will help.

Well. History is full of good ideas executed with the wrong timing.

I am not convinced that this is not possible, just because a project from the early 2000s failed, and Knuth said it was a bad idea.

I am not talking about a general optimizer of poorly written arbitrary programs. Rather an optimizer for parts of the code written with a certain paradigm. Just like BigQuery does with SQL.

(Thank you for sticking with the thread)

Sure, such "optimizers" exist already and are well deployed in the field but I've not seen anything that makes runtime decisions about every for loops within a program; these are developers applying their domain knowledge to say "it's important to parallelize here" and the framework has capabilities to do so efficiently because the workload can leverage it.

I'm not saying it'll never happen, but to apply it to every for loop is very different from a framework that speeds up a kind of pattern (which as I said is already a thing). The vast majority of loops don't benefit from this which is why you don't see #openmp pragmas on top of every for loop. The gains are likely minimal from doing it for every loop, you can introduce serious performance regressions when you guess wrong because you don't know about the data pattern within the loop (e.g. you introduce cache contention because the multiple threads end up writing to the same cache line), & most loops don't benefit from parallelization in the first place meaning you're doing a lot of work for no benefit. Maybe bad timing but my hunch is that I'm unlikely to see something like this in my lifetime.