| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by suvinay 3275 days ago

Hi all. This is Suvinay Subramanian from MIT. I am one of the authors on the Fractal paper. I wanted to provide some context for this work, and briefly summarize our paper. I hope this will address some of the comments on this thread. And I am happy to answer any further queries you may have.

You may also be interested in the slides of this work presented at the International Symposium on Computer Architecture (ISCA) last week: http://bit.ly/fractal-slides

What motivated this work? While multi-cores are pervasive today, programming these systems and extracting performance on a range of real world applications remains challenging. Systems that support speculative parallelization, such as Transactional Memory (supported in hardware in commercial processors today, such as Intel Haswell, IBM's Power 8 etc.) help simplify parallel programming, by allowing programmers to designate blocks of code as atomic, in contrast to using complex synchronization using locks, semaphores etc.

However, their performance on real world applications remains poor. This is because real world applications are complex, and often have large atomic tasks either due to necessity or convenience. For example, in database workloads, transactions have to remain atomic (i.e. all actions within a transaction have to "appear as a single unit"). But each transaction can be long (~few million cycles) -- such large tasks are prone to aborts, and are expensive to track (in software or hardware).

Now, there is often abundant parallelism inside such large tasks (as has been noted in prior work). But the challenge is how to extract this parallelism while also keeping the entire unit atomic. Prior approaches, including nested Transactional Memory, and Thread-Level-Speculation (TLS) have been unsuccessful in extracting this parallelism -- section 7 of our paper explains why (or see slides).

Fractal provides a new execution model that addresses the limitations of prior approaches, and proposes a hardware architecture that can successfully scale several challenging workloads -- some of which have eluded successful parallelization for decades. Importantly, Fractal allows us to operate at the granularity of "small" tasks that are easy to track. Further Fractal enables programmers to seamlessly compose speculative parallel algorithms.

What are the key ideas of Fractal? There are two key insights: 1. Decouple atomicity from parallelism -- We introduce the notion of a domain, where all tasks belonging to a domain appear to execute as a single atomic unit. By doing this, we can execute the "subtasks" in parallel, while ensuring atomicity of the "collection of subtasks" via other mechanisms.

2. Ordering to enforce atomicity -- Unlike prior techniques that "merge" subtasks to enforce atomicity, we leverage ordering to enforce atomicity.

Tell me more about the hardware, aka the Swarm architecture. Conventional wisdom dictates that ordering limits parallelism. Now ordered parallelism is hard -- there have been several attempts at taming this beast, including Thread-Level Speculation (TLS), as alluded to in one of the comments.

In our prior work, Swarm, we demonstrated that it is possible to extract parallelism even in the presence of ordering constraints, with a more expressive execution model, and some nifty hardware tricks.

At a high level, the Swarm hardware executes tasks speculatively, and elides ordering constraints. Only true data dependences force an "ordered" execution. On the hardware side, Swarm has large hardware task queues to manage tasks, and a high-throughput commit protocol and speculation mechanisms that allow it to scale.

You can read more on Swarm at: http://bit.ly/swarmarch

Fractal relies on the Swarm hardware to enforce ordering guarantees while extracting parallelism. The key challenges were how to assign a suitable ordering to the subtasks that also maintains the atomicity of "domains" of subtasks, how to assign this ordering dynamically, and how to support "unbounded nesting", eg: if your subtask makes a library call, that itself could be parallelized (but must still remain atomic), and so on (supporting unbounded nesting is essential for software composition of algorithms).

Is Fractal / Swarm just an accelerator? We believe that Fractal and Swarm are a more general solution to some of the hardest parallel computing problems. We have shown that the ideas presented in these works are applicable in a wide range of domains ranging from databases, to discrete-event simulation, to machine learning, and graph analytics.

When will we see this in hardware? Currently, we evaluate our ideas in simulation. But our group is working on prototyping this on an FPGA in the coming year. So stay tuned!

1 comments

deepnotderp 3275 days ago

Awesome! I have a couple of questions if you don't mind:

1) Does this require the programmer to rewrite software or make annotations? Even if that's the optimal way, can a compiler do this automatically, at least to some extent?

2) how general of a workload can this accelerate? You mention speculative multithreading and HTM in both papers, both of which are fairly general purpose techniques.

link

suvinay 3275 days ago

1. Currently the programmer needs to delineate code into tasks. A couple of points to note: -- Often the programmer can start with the serial code, and determine how to delineate tasks. In our experience this is often easier than writing more complex parallelized versions of the same algorithm. Compared to the serial code, our programs are just a few percent longer. -- We have implemented high-level interfaces along the lines of OpenMP or OpenTM to make it easier for the programmer to break straight-line serial code into fine-grain tasks. For example, if you have a for loop to parallelize, you can use the fractal::forall(...) construct. Table 1 of the Fractal paper lists some of these high-level interface functions.

We are currently working on a compiler that can take serial code and automatically annotate tasks and domains. Keep watching our webpage for updates on that.

2. Our execution model subsumes both Transactional Memory (TM) and Thread-Level Speculation (TLS). Any program that can be expressed in those paradigms can be expressed in Fractal. And as we show in the paper, Fractal can express parallelism that was not possible in TLS or TM. So we believe it is general purpose. Further. as indicated above, we have seen these ideas to be applicable to applications in a wide range of domains.

That being said, we are always on the lookout for applications that reveal deficiencies in our execution model, to help improve it :)

link

deepnotderp 3275 days ago

1. Yes ,I've seen that, and I think that an OpenMP-esque annotation system specified by the programmer would be optimal, but I don't see any reason that a "sufficiently smart compiler" [ ;) ] can't produce an approximation. Given the SpMT-like "thread" (task) squashing, it shouldn't cause any semantic issues.

2. Absolutely, SpMT has been proposed as a general purpose technique, that's why I found it odd that the benchmarks weren't on a general purpose benchmark like SpecINT, do you have any benchmarks on SpecINT or any plan to do so?

link

suvinay 3270 days ago

2: We are currently working on a compiler to automatically delineate tasks, assign domains etc. given sequential code. As part of that effort, we are looking at SPEC workloads, yes.

link