Hacker News new | ask | show | jobs
by trsohmers 3621 days ago
The vast majority of the dynamic parts of program that matter for scheduling (both when it comes to ILP/avoiding hazards within a core and when it comes to handling memory management for our scratchpad based memory system) are due to indeterminate latencies for memory accesses and executing instructions (due to variable length pipelines). Throw in horrible (for determinism) things like out of order execution and and branch prediction and no wonder a compiler can't determine things statically! While we are not really targeting general purpose (though I would say we have the capability to evolve to it in the future) it seems painfully obvious to me where these issues have been in any general-leaning VLIW attempts in the past, and I can't understand the clinging nature to bad architectural decisions in the past by hardware folks 30 years ago that could not imagine the ability of software in the future. </rant>

Targeting general purpose from the get go is a bad idea, but it NOT impossible to do efficiently and without sacrificing performance. You just need a well defined and constrained architecture, and a clean way to describe it.

1 comments

You have your causality relations reversed: the reason that branch prediction and dynamic caches exist is that because jump targets and working sets are hard to impossible to compute statically.

Even in the restricted world of HPC, GPGPUs have been moving from statically scheduled exposed pipeline VLIW machies to more conventional SIMD with caches, virtual memory and branch prediction (no meaningful OoO yet as the large amount of thread parallelism can hide the memory latency).

Also GPGPU have the benefit of having the large, lucrative GPU gaming market to pay for their development. How can a pure HPC machine be competitive in this market? Even for Intel Xeon Phi is more of a prestige project than actually meant to make money.

I've spent a long time debating with VLIW haters (that I presume you are with), but I'd love to see any citations you have for your claim that my causality is reversed, as I have a ton of evidence (to be fair not published yet) going for my side. While not as generally applicable as our architecture, you can take a look at basically any DSP from the past 15 years and see that VLIW works great from a performance and efficiency standpoint when your data is in a constrained form. We're showing that a compiler can structure a lot of different types of data (and the code required to actually operate on it) effectively if there are enough constraints on the hardware. Fairly pointless to try to convince you without documentation on hand for all parties, but hope you'll take a look in a couple of months.

As far as market, we are going after a decent sized market where the customers care the most about efficiency and performance, and are not only willing but very eager to switch their current solutions for whatever is best. As the typical startup claims, we are able to do it for a fraction of the cost and in a fraction of the time as one of the big guys, and have a solution that is 10x better than is out there. NVIDIA boasts that they spent $1 Billion developing the Pascal architecture, with them selling the Tesla series GPUs for it at $5,000+ a unit. We've shown we can prototype something that can theoretically beat it for under $2 million, and our hope/bet is that we can take it to market (and actually beat it by an order of magnitude) for less than $25 million. That's just HPC, which doesn't include the very interesting high end DSP area that is now using very expensive and power hungry FPGAs for wireless baseband solutions which we think are a very good fit for us.

Just to clarify: are you trying to compete with Nvidia, or with Intel? If you're going against GPUs, is your chip something that can run neural networks (better than Nvidia)?
Short answer: If we were to implement SIMD FP16 support similarly to how we have a planned dual FP32 in our FP64 FPU, we would be able to easily match GPU performance by throwing more cores at the problem, while still being more efficient. While neural nets/machine learning is interesting, and we could potentially enable it in new forms as we can provide a desktop GPU's capability in a much smaller/lower power form factor, it is not our main focus. As the other commenter noted, there are ASICs that do a good job at that, though since we are more generally programmable than those sort of ASICs, we would be able to handle changes in algorithms over time while some may not be able to.

The more interesting problems for us are things that GPUs can't do well, such as level 1 (vector) and level 2 (matrix-vector) BLAS operations. While most GPUs (and CPUs when utilizing SIMD instructions) only get a couple of percent the performance on level 1 and level 2 BLAS compared to level 3 (matrix-matrix), we are equally performant across all three (and at a very high percentage of theoretical peak).

Interesting. Which applications require vector-vector or matrix-vector operations as opposed to matrix-matrix?
Also, custom ASICs are the current state of the art for NN.

edit: missing word

Which custom ASICs are you talking about?
I'm referring to Google's TPU.
VLIW have been used very successfully as DSPs for a long time, I do not think anybody is debating that. It is outside that niche that they have repeatedly been found lacking.

I'm sure your architecture would work fine for a subset of HPC problems like those that are currently run on a traditional GPGPU, but even in the HPC world many problems are ill suited for a GPU (think particle transport).