| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by DeepDuh 4982 days ago

I've done my master thesis on GPGPU, so maybe I can help out a bit. I'm not yet too familiar with Epiphany's design however. From what I could grasp what sets them apart the most is a different memory architecture compared to multicore CPUs, where the individual cores seem to be optimized for accessing adjacent memory locations as well as the locations of the direct neighbors. This is one point where the architecture seems to be similar to GPUs, although GPUs have a very different memory architecture again - for the programmer it might look similar however, especially when using OpenCL.

The main point where Epiphany is diverging from GPUs is that the individual cores are complete RISC environments. This could mainly be a big plus when it comes to branching and subprocedure calls (although NVIDIA is catching up on the later point with Kepler 2). On GPUs the kernel subprocedures currently all need to be inlined and branches mean that the cores that aren't executing the current branch are just sleeping - Epiphany cores seem to be more independent in that regard. I still expect an efficient programming model to be along the same lines as CUDA/OpenCL for epiphany however - which is a good thing btw., this model has been very successful in the high performance community and it's actually quite easy to understand - much easier than cache optimizing for CPU for example.

If we compare epiphany to CPU what's mainly missing is the CPU's cache architecture, hyperthreading, long pipelines per core, SSE on each core, possibly out-of-order and intricate branch prediction (not sure on those last ones). The missing caches might be a bit of a problem. The memory bandwidth they specify seems pretty good to me, but from personal experience I'd add another 20-30% to the achievable bandwidth if you have a good cache (which GPU has since Fermi for example). The other simplifications I actually like a lot - to me it makes much more sense to have a massive parallel system where you can just specify everything as scalar instead of doing all the SSE and hyperthreading hoops like on CPUs - optimizing for CPU is quite a pain compared to those new models.

1 comments

varelse 4982 days ago

Assuming you're programming it in OpenCL, it's effectively a GPU with many more SMs but with a narrower SIMD width. If they were to give it, say, 16-way predicated SIMD with incomplete IEEE compliance on par with the Cell (~4M transistors per core plus a wider internal bus), it would become a very interesting processor IMO with ~1.4 TFLOPs per 64-core epiphany board. At the very least, they'd get bought out if they built such a beast and undercut NVIDIA, AMD, and Intel. Just sayin'...

In the meantime, leave the fast atomic ops, ECC, and full IEEE compliance to the GPUs and Xeon Phis of the world until you have the transistor budget to go after them...

All IMO of course...

link

vidarh 4982 days ago

> If they were to give it, say, 16-way predicated SIMD

I think that would completely defeat the purpose of the architecture, as it'd massively bloat the transistor count per core. Their roadmap is for 1000+ independent cores on a single chip, not stopping at 64 per board.

link

varelse 4982 days ago

And there's the problem: my personal bias from years and years of GPU programming is that I'd rather target 4 cores with 16-way SIMD than 64 cores each with scalar, or to quote Seymour Cray - "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?"

Besides, this is 28 nm technology and 15x15 mm, no? That's 225 mm^2. AMD's 28 nm Tahiti is 365 mm^2 with 4.3B transistors, making this thing ~2.7B transistors give or take or ~41M transistors per core. Adding 4M transistors (source: it's about 1M transistors on a Cell chip per 4-way SIMD unit) is <10% larger in exchange for 16x the floating-point power. Unless I'm missing something, I'd build that chip in a minute...

Which is to say I don't want 1000+ wimpy cores - it'll get smashed by Amdahl's Law - when I can have ~900 brawny cores. NVIDIA and AMD have been exploring this space for almost a decade now and to start over without considering what they may have gotten right and what they have learned while doing so seems a little daft to me.

link

vidarh 4982 days ago

> I'd rather target 4 cores with 16-way SIMD than 64 cores each with scalar

You're assuming problems that are suitable for SIMD. If you have problems suitable for SIMD, use a GPU. Lots of problems are NOT suitable for SIMD.

If those 64 data streams all happen to require branches regularly, for example, your 4x 16-way SIMD is going to be fucked.

> Besides, this is 28 nm technology and 15x15 mm, no?

Where did you get that idea? Their site states 2.05mm^2 at 28nm for the 16 core version. 0.5mm^2 per core.

So by your math, more like ~26M transistors, or ~1.6M per core. Your estimated die size is 70% larger than what they project for their future 1024 core version...

Source: http://www.adapteva.com/products/epiphany-ip/epiphany-archit...

> it'll get smashed by Amdahl's Law

This is a ludicrous argument when arguing for a GPU architecture instead. A GPU architecture gets affected far worse for many types of problems, because what is parallelizable on a system with 64 general purpose may degenerate to 4 parallel streams on your example 4 core 16-way SIMD.

There are plenty of problems that do really badly on GPU's because of data dependencies.

> when I can have ~900 brawny cores

Except you can't. Not at that transistor count, and die size, anyway.

> NVIDIA and AMD have been exploring this space for almost a decade now and to start over without considering what they may have gotten right and what they have learned while doing so seems a little daft to me.

Have they? Really? They've targeted the embarrassingly parallel problems with their GPU's, rather than even try to address the multitude of problems that their GPU's simply will run mostly idle on, leaving that to CPU's with massive, power hungry cores and low core count. I see no evidence they've tried to address the type of problems this architecture is trying to accelerate.

Myabe the type of problem this architecture is trying to accelerate will turn out to be better served by traditional CPU's after all, but we know that problems that don't execute the same operations on a wide data path very often are not well served by GPUs.

link

varelse 4979 days ago

Mea culpa on the die size...

That said, this is where the R&D done by AMD and NVIDIA have expanded what is amenable to running on a GPU. Specifically, instructions like vote and fast atomic ops can alleviate a lot of branching in algorithms that would otherwise be divergent. It's not a panacea, but it works surprisingly well, and it's causing the universe of algorithms that run well on GPUs to grow IMO.

What I worry about with Parallela is that by having only scalar cores, and lots of them, it has solved issues with branch divergence in exchange for potential collisions reading from and writing data to memory. The ideal balance of SIMD width versus cores count is a question AMD, Intel, and Nvidia are all investigating right now. But again, ~26M transistors - no room for SIMD...

link

daniel-cussen 4982 days ago

Why plow a field with 1024 chickens, when you can plow it with 1M worms?

The GA144's F18 core has ~20 thousand transistors, and is asynchronous, and if you make the die size the size of an Opteron, and if you wait until you can pack 20B transistors on a die, you get---one million---cores.

link

varelse 4979 days ago

That chip would so cool if only its native internal representation were 32-bit... Sigh...

But it's way better than this monstrosity: http://web.media.mit.edu/~bates/Summary_files/BatesTalk.pdf

link

DeepDuh 4982 days ago

There is certainly something to what you say. The advantage of the GPU model is that you can have the ALUs occupying a much higher percentage of your die if each core is less independent. Independent threads is not necessarily what you need on an accelerator card - that's what you have CPUs for anyways.

link