Hacker News new | ask | show | jobs
by mbenjaminsmith 4995 days ago
I don't really have any comment on the project itself (not something I would ever use and I don't know the value of what they're proposing).

But on purely geek terms this thing seems to warrant a "holy shit":

http://www.adapteva.com/products/silicon-devices/e64g401/

Again I don't know how (un)common that sort of thing is but I wasn't expecting to see 64 cores in that tiny form factor. Does anyone here know how cutting edge this thing is if at all?

[Edit]

Also does anyone here want to address use cases for this thing?

5 comments

Well, NVIDIAs Kepler GPUs have 1536 cores on something like 320mm^2. I can't really find the die size of that adapteva product but I'd say it comes out at a similar range.

Having looked at the data a bit more: I like their specs concerning system balance. 100 GFLOPS over 6.4GB/s gives you a system balance of 15.625 FLOPS per memory access, that's about the same balance as a Westmere Xeon - pretty good for real world algorithms.

For comparison: NVIDIA Fermi has a system balance of about 20. Meaning: Fermi is sooner bounded by memory bandwidth, which is very often the limiting factor in real world computations.

One thing though: High Performance Computing is all about software / tooling support. If this company comes out with OpenCL in C (even better Fortran 90+) support, then we're talking.

Edit: By similar 'range' I meant core per mm^2 ratio.

Prepare to be surprised. The die size estimate for the Epiphany IV is 10mm-sq according to Adapteva. It is more appropriate to compare it to embedded GPUs than desktop GPUs in die size, power and performance.

For example, one particular embedded 40nm GPU design that I know about can deliver about 25 GFlops or so in the same die area.

Some of that GPU die area is used for graphics features that compute programs don't need. But a lot of it is providing performance even though it's not providing FLOPS. GPUs have caches and multithreading for a reason; if you could get better performance with an ultra-simple architecture then ATI/Nvidia would have done that already.
GPUs are primarily designed to be good for graphics, which implies completely different internal architecture. While GPUs have some graphics-oriented functional units the main factor is that all these cores have to access pretty large chunk of shared memory (textures, frame buffer...) and do that uniformly fast (and also support some weird addressing modes and access patterns). I suspect, that large part of die area of modern GPU is interconnect and that there really are few very wide cores (something like VLIW+SIMD+ possibly UltraSparcIV style hyperthreading, but that can be faked by compiler given sufficiently large register set) that are made to look like large amount of simple cores by magic in compiler (which seems consistent with CUDA programming model).

So: you can get large amounts of performance with simple architecture, but only for some problems, with graphics not being in set of these problems.

Sorry, but I have to correct a little bit here. Today's GPUs are

- not simple SIMD. NVIDIA calls it SIMT (single instruction multiple thread), mostly since you can branch a subset of them, so for the programmer it does feel somewhat like threads.

- not just optimized for Graphics anymore. E.g. since Fermi, the Tesla cards have DP performance = 50% of SP - which has been specifically introduced for HPC purposes. They have also constantly improved the schedulers to go more into general purpose computing, e.g. Kepler 2 seems to support arbitrary call graphs on the device. Again, that's useless for graphics.

- suitable for pretty much all stencil computations. Even for heavily bandwidth bounded problems GPUs are generally ahead of CPUs since they have very high memory bandwidth. The performance estimate I use for my master thesis comes out at 5x for Fermi over six core Westmere Xeon for bandwidth bounded and 7.5x for computationally bounded problems.

HPC is all about performance per dollar, performance per watt - and (sadly) sometimes linpack results because some institution wants to be in the top of some arbitrary list. In all of these aspects GPUs come out ahead of x86, which has been very dominant since the 90ies. Which is why GPUs are now in 4 of the top 20 systems - each of those are hundreds of millions of dollars in investments. That wouldn't be done if they weren't suitable for most computational problems.

My point is that GPUs have significantly different architecture from most of these "many cores on a chip" designs. Original reason for that was clearly that such architecture was necessary for graphics, coincidentally it works better for many interesting HPC workloads. It's clear that manufacturers are introducing technologies that are not required for graphics, but they cannot be expected to do modifications that will make their GPUs unusable for graphics.

And as for SIMD/SIMT, I mentioned SIMD mostly in relation to operations on short vectors done by one thread, which is mostly irrelevant to overall architecture of the core, as it can very well be implemented by pure combinational logic in one cycle given enough space. My mental model of how modern GPU core (physical, not logical) actually works is essentially some kind of simplistic RISC/VLIW design with large amounts of registers with compiler and or hardware interleaving instructions of multiple threads into one pipeline, which may or may not be how it actually works but it looks probable to me.

In my opinion most of chips like Epiphany IV or XMOS or whatever, in contrast to GPUs, are useful for only limited classes of workloads as they tend to be memory starved.

Ok, so that's 6.4 cores per mm^2 while Kepler has 4.8. Not bad, considering NVIDIA has already shrunk the scheduling ressources, register blocks and cache sizes per core to a bare minimum (something I don't agree with btw.).
What do you consider a Kepler core? The Epiphany cores are dual issue RISC processors that run independent C/C++ tasks.
Kepler has 8 "SMX" with 196 parallel sp threads each. For me the number of cores = the number of parallel threads, although on GPU they are not as independent, i.e. each "core" of an SMX either executes the same instruction on adjacent data or does nop. With dual issue do you mean a two stage pipeline or two threads in parallel, both performing FLOP?
"Dual issue" means "can issue two instructions per cycle", independent of pipeline depth or multithreading. In the case of the Epiphany, it can issue an ALU instruction and a floating point instruction each cycle.
OpenCL SDK support is provided. They also have a C/C++ compiler with OpenMP support.

edit: No OpenMP support.

Sounds pretty good, thanks for the heads up. Now I'm curious to see some benchmarks as soon as someone puts 20 or 30 of these on a board with lots of GDDR 3 Ram :).
Sorry apparently I was wrong about OpenMP support, not there right now but there is a C/C++ compiler with OpenCL.
OpenMP support would be interesting, and should be possible by extending what we did for the OpenCL support. The basic machinery is very similar. Also, someone mentioned Fortran. There are Fortran bindings for the STDCL API that is built on top of OpenCL, so this could help interface to existing Fortran codes and provide a partial solution for Fortran programmers.
http://en.wikipedia.org/wiki/TILE64

Tilera did a very similar looking 64 cores on a chip in 2007, which is the oldest instance I know of off the top of my head. Their devices cost(or at least they used to) a few grand though. Tilera has bumped it up to around 100 per chip these days. I don't know anything about either architecture so it is hard to say if 64 1Ghz adapteva cores compares with 64 1.5Ghz Tilera cores.

So not quite cutting edge just an under explored side channel.

I haven't looked at the specs, yet, but this is what I've had in mind: Roll one out to help with deep packet inspection of some of my network traffic. Spam filtering might also be offloaded to one of these guys.

Dedicated machines to host backend applications -- SQL servers, Apache, nginx, etc.

You'll need this if you want energy-efficiency when solving parallelizable problems. Use one chip in energy-limited systems, like battery-powered robots. Use multiple chips in power/heat-limited systems, like supercomputers.
Don't modern GPU's have essentially thousands of cores?
It really depends on what you consider a core.

http://www.anandtech.com/show/2918/2

That first picture shows 4 cores made of 4 sub cores with 32 processing elements each. Now Nvidia would claim each of those 32 processing elements is a core, but each of those cores can not act independently. So it is more like a very wide, very hyper threaded 16 core processor.

I think NVIDIAs definition of a 'core' has some merit. First of all, they have some independency in that you can introduce branches over a subset of them, so they're not just SIMD vector units. Secondly, their threaded programming model is pretty well suited for many computational tasks. Executing the same operations over a whole 2D or 3D region of data is a pretty common thing in computing. If you can't parallelize your task that way, chances are it's not even parallelizeable on N x86 cores. If you compare this to x86 however, you'd have to count n Cores times the SSE vector length on each core to be fair. GPUs still come out ahead for most of heavy computational tasks though - which is why Intel is now fighting back with their Xeon Phi stuff (which sounds very promising btw., looking forward to play with our prerelease model that's coming soon ;) ).
Yes, but modern GPUs also use upwards of 100W of power, and their cores operate mostly in lock-step, which means they aren't fast for all kinds of tasks.