Hacker News new | ask | show | jobs
by DarkShikari 5178 days ago
If only it was that easy: change programming languages, change programming models, and poof! Magical parallelism.

But parallelism is harder than that. It's an algorithm problem, a design problem, not a language or code problem. While OpenCL might be harder to write than plain C, for anything except the most embarrassingly parallel problems, that difficulty pales in comparison to making the solution parallel to begin with.

For every problem where you can simply split into masses of shared-nothing tasks, there's a dozen others where you can't. Rate-constrained video compression. Minimax AI search. Emulation. All of these can be parallelized, but it requires parallelizing the algorithm and making sacrifices that are light-years beyond what a dumb compiler that isn't even allowed to change program output (let alone rewrite half the program) could do.

Modifying an algorithm -- possibly even changing its structure entirely, changing its output, or even accepting nondeterminism -- is inherent complexity, to use the terminology of Mythical Man Month. No programming language or tool can eliminate this complexity. Good tools can make it easier to implement an algorithm efficiently, but they really can't take a complicated application and figure out how to change its design, structure, and behavior. Until we have AI-complete compilers that take program specs as "code" to be compiled, that's a human job.

5 comments

One significant issue is that most programming languages are inherently temporally inexpressive, having either full order (imperative/impure functional) or no order (purely functional). Ideally the language would make it easy to organize statement in such a way that a programmer could indicate a partial ordering relation without having to use function composition. Additionally, a big problem with function composition in purely functional programming is that multiple long function compositions can only be ordered as monolithic units, however in many cases being able to order at the individual function level is desirable. If a language had an explicit notion of relative temporal index for statements (which could be adjusted) it would be a nice win for writing concurrent code. That would also let compiler writers lift a lot of the stuff they do up to the program source level as macros (which would be a HUGE win).
A futures library solves the case where, in an imperative programming language, the programmer writes a block to compute A, a block to compute B, then combines A and B. Pure FP works fine in that case, since writing (+ (compute-A) (compute-B)) does the same ordering.

It seems to me pure FP is fine unless computations for A and B have interdependencies, which would result in duplicate computation in pure FP unless you can refactor to a different algorithm. I'm not sure I understand what you mean by wanting to control ordering at a "the individual functional level", unless you mean interdependencies between computations done in "monolithic units".

    A->(loop B until done)->C
    A->(loop D until done)->E
    
    C-\
      +-> F
    E-/
Something like that. Am I one the right track?

No, because I can still express that in pure FP:

    (let [(A (compute-A))]
      (let [(C (compute-C A)
            (E (compute-E A)]
         (compute-F C E)))
Trying again ...

    A->(loop B until (valid B D))->C
    A->(loop D until (valid D B))->E
    
    C-\
      +-> F
    E-/
Is that better? Care to help me understand what you're getting at?
I was thinking more along the lines of working around data-locality issues. For instance, imagine that you have code that requires very high latency fetches, or you are working with a data set that doesn't fit in memory. Typically, you have to develop modified algorithms that for these scenarios, but there is no reason that a minimal fiber scheduler couldn't adapt seamlessly. Even better, if your conditions change (like for instance, mobile devices moving from low throughput to high throughput links) a scheduler can adapt, but the hand rolled algorithm must be re-coded.
Did you mean a data-flow language. A language that is driven by the stream of data? Mozart/OZ implements some of that. Also see here: http://en.wikipedia.org/wiki/Dataflow_programming
I am a big fan of synchronous dataflow as a way to organize programs. It meshes well with the idea of modelling knowledge rather than "programming".
> No programming language or tool can eliminate this complexity.

But some languages can help get you there. Programming languages are tools and these tools come with idioms and paradigms of their own.

Certain paradigms help, often by constraining the developer to think and write in a way that makes the final result more parallelizable.

For example functional languages with immutable data structures help guide the implementation towards that. Actor models also do that.

Take a program in Erlang that someone wrote 5 years ago. They split it into 10 concurrent, cpu-bound, actors. Years ago it ran on a single CPU hardware. So the code was robust and it benefited from process isolation but didn't actually execute in parallel. It might have even been slower than the equivalent C++ or Java code.

Now all of the sudden there is a performance requirement. The site goes "viral" as they say. Now there are millions of requests per day not hundreds. If the program was written idiomatically correct all they have to do is get a big hunkin' box with 16 cpus and lots of ram. And all of the sudden those actors are working in parallel. No change to the code needed.

Next month the site goes "pandemic" so now they have to think about scaling beyond a single machine. No problem, they just move some of the processes (actors) to another node running on another machine. The way to send a message to an actor is still the same Pid ! Msg, just like it was in 2006 when running on a single core CPU. Still minimal code changes needed.

This is an example where languages and toolsets help guide developers in a certain way.

You brought the example of algorithms. The thing is, I believe, not that many programmers these days design algorithms, unless they are involved in research or are in graduate school. Most programmers engineer systems by stitching together components and APIs. I think with the right frame of mind and after some practice, it is not that hard to see how to split a lot of these components into concurrent pieces that can then be parallelized if the correct language and toolsets are used. It is often not the inherent problem but the way the mind of the programmer has been molded by years of applying another paradigm (ex. imperative, OO, mutable state).

You seem to be of the opinion that designing a parallel architecture or algorithm is really the difficult bit, and that what we need is the magical paralleliferizing compiler which as you say, doesn't exist.

Designing a parallel approach to a problem, where such a solution is possible, is always going to be easier than trying to implement it correctly using threads and locks and mutable state.

FP can make some of the basically-impossible possible, with e.g. STM to avoid thread deadlocks and starvation, or even better, by supporting deterministic parallelism so you don't have to mess with threads and concurrency at all.

Designing a parallel approach to a problem, where such a solution is possible, is always going to be easier than trying to implement it correctly using threads and locks and mutable state.

I haven't found this to be true in my (limited) experience. In writing sliced-threading support for x264, threading issues took up only a small percentage of my time. Admittedly, this is a rather simple application: the frame is split up into independent threads which never wait on each other, communicate asynchronously without any attempt at determinism, and all finish before the program can continue.

My experience is limited, but a generalization like "always going to be easier" seems rather at odds with reality. I have never found dealing with mutexes to be difficult in any situation. What I have found to be difficult is trying to design lock-free code, using memory barriers, and other trickiness to avoid slow locks in situations where the overhead of locking is just intolerable. This is definitely an application where things like pre-made lockless datastructures and the like come in handy.

I've been doing some real-time audio programming lately, which requires the same techniques. You can't take a lock or even malloc in this kind of code, so you're forced to implement things like lock-free ring buffers to communicate with the rest of the world. In my experience this is 10x harder than conventional lock-based parallelism and, unfortunately, I don't see any FP lang coming to the rescue in this domain any time soon.
> My experience is limited, but a generalization like "always going to be easier" seems rather at odds with reality.

You're probably better at reasoning about nondeterminism than I am. In my also limited experience, even trivial parallel algorithms are tricky to get right. So it seems to follow that an algorithm that took more thought (say, something using speculative parallelism) would likewise be even trickier to implement correctly.

What you say is true (http://en.wikipedia.org/wiki/Amdahls_law), but it's not really the point. Most programming is not that hard, it's just moving data from one place to another, operating on that data, and then moving it somewhere else. Using functional languages trivially allows you parallelize these types of operations, and forces you to think about problems in a way that will allow you to parallelize other things more easily.

So you're right, good design is a human's job, but when you're just doing things the grunt work that most of software development is, having tools that enforce The Right Thing is very helpful.

This isn't about Amdahl's law; this is about the fact that many problems are simply hard. Let's look at some examples.

One common example is the need to split a larger task into smaller subcomponents, but where all of the components combined need to obey some global constraints. A specific example: we're compressing a video frame, the result must fit within a certain size, and we can't parallelize over multiple frames for latency reasons. This means we need to split the frame into chunks, but somehow all the chunks have to communicate with each other in real-time, as they work, in order to ensure they obey the global constraint. Do you have a "master" thread that manages them all and makes decisions? Do you use some sort of algorithm where they act as separate agents, asking each other questions? Suddenly this is a lot more complicated than what you started with.

Another example is a search algorithm. Whether you're performing a minimax search of a game tree or simplex optimization, you're implementing algorithms that are normally not parallel. Parallelizing minimax is actually incredibly difficult and requires making a lot of hard decisions about where you branch threads, where a thread gives up, how to avoid duplication, and so forth. Fancy programming tools can give you features like thread-safe hash tables that help you, but they don't solve the actual problem. See any multithreaded chess engine for an example of this problem. Note particularly that the engines don't get perfect speedup from multithreading -- but it's NOT because of Amdahl's Law! It's because the searches between threads unavoidably duplicate at least some work.

"Moving data, operating on it" would be grossly oversimplifying real-world, complex programs like these, and there's nothing a functional language would do to "trivially parallelize" them. Dependencies in calculations are often so tangling that you cannot naively parallelize them without making dramatic, possibly sacrificial, changes.

Tools like FP can be useful, but they don't solve problems of inherent complexity. There is no silver bullet.

I have a little hunch that any parallelism "answer" is going to be derived primarily from ever-more-grand exploitation of its strengths, not patching of its weaknesses.

We know that in a formal reasoning sense, an asynchronous system is less powerful than a synchronous one, because you can implement any async design in a sync system, but not the other way around. Yet in practical use, asynchronous, "hard" decoupled systems are the ones that scale better and are easier to maintain. So we keep telling ourselves, "decouple everything" and have invented a gross array of patterns and techniques that add "decoupling pixie dust."

We know this, but usually don't extrapolate it to its natural conclusions: we're using brute-force engineering to attack our needs from the wrong direction - trying to break arbitrary sequential computations into parallel ones via sufficient smartness, instead of folding parallel ones back into sequential ordering where it produces wins, and breaking out the "sequential toolbox" only if the problem really taxes our abilities of algorithm design, as in the cases you have outlined.

Of course, adhering to parallel designs from the beginning is hardly a silver bullet either, but there are real wins to be had by exploring the possibility, and a good number of them can be experienced today simply by working that style into "ordinary" imperative or functional code with abstractions like actors and message-passing.

Correct me if I'm wrong, but in a formal sense you get the synchronous system inside the asynchronous one trivially. So the two are of equivalent power.

If you imagine that in the case of threading systems, you can implement a multi-threaded system with your own green threads library, which is async on top of sync. You can get sync on top of async with native threads by simply using one thread. Or, if you want to be complicated, by using multiple threads with locks that serialize to a single synchronous ordering.

> We know that in a formal reasoning sense, an asynchronous system is less powerful than a synchronous one, because you can implement any async design in a sync system, but not the other way around.

As someone else pointed out as well, I just don't see it. If components of an asynchronous system can wait on an event you can make it synchronous by driving it with deterministic synchronous events. Imagine a bunch of independent actors that always wait for a clock signal and then processing one piece of data then waiting for next and so on.

So I actually see a synchronous system a restricted case of an asynchronous system.

If you think about it, the world is inherently asynchronous. If you have 2 agents in the real world. They process and do things asynchronously, there is no global event or clock system that drives everything.

When you start talking about latency, you are inherently talking about Amdahl's Law because the question that gets begged is, "can we make amount of work X complete in less time with more processing power Y" ? And in your problem domain of video coding, esp if it is for real-time face to face communication or telematics, latency matters. So Amdahl's Law matters because it dictates the bound on latency.

Doing more work X (making the pie bigger) in about the same time as Y is http://en.wikipedia.org/wiki/Gustafson%27s_Law, latency, like the time of light it is one of those physical universal properties we won't be able to tech our way out of. Latency is inherently a serial problem. What changes when we could try every solution at the same time?

Nearly all hard problems we are solving right now don't need to have exact solution. They can be time bounded or quality bounded or both. What we lack is a clarity in being able to create algorithms that are probabilistic and progressive. Algorithms that refine and converge towards a solution over time. Better languages can help with that.

Here are some examples of probabilistic parallel minimax algorithms.

http://www.top-5000.nl/ps/Parallel%20randomized%20best-first... http://www.math.md/files/basm/y2010-n1/y2010-n1-(pp33-46).pd...

I don't think value of FP is that it solves the complexity of parallelism. Rather it is that once one has solved the problem (designed the algorithm), FP can help assure the solution is implemented correctly. This in and of itself is a large advantage.
I use functional programming because I lack the mental ability to hold so much state in my head. A friend of mine is a badass at programming in C++ using threads. I don't and will not have the ability he does. So I use FP and my programs are correct and by being functional the options to speed up my code are also easily proved correct.

Being able to handle large amounts of complexity and being ok with it is not necessarily a good thing. If I could reason like a compiler I would probably just use a flat memory model and address each byte individually.