Hacker News new | ask | show | jobs
by Nomentatus 3622 days ago
This is a 2015 story that I remember reading, then. Google news search shows only a couple articles this year about Rex Computing and only one tiny bit of news, that they're at tapeout. That's probably par for the course for a startup creating product (or prototype) one. http://semiengineering.com/power-centric-chip-architectures/

also a speaking engagement: http://insidehpc.com/2016/01/call-for-papers-supercomputing-...

and a comment elsewhere that mentions another approach: the "Mill CPU of Mill Computing"

As I recollect (perhaps quite wrongly) Itanium (VLIW) failed because compiler-writers couldn't really be bothered or couldn't mount the learning curve. So I'm most curious about what progress is being made on the compiler side.

3 comments

You are correct that we have already taped out, though we haven't made any announcements yet, though will be talking publicly about it in the future with a big focus on the "magic" on the software side.

You can read my comments on the Mill architecture elsewhere on HN (not a fan of stack machines), but my biggest disappointment in them is the fact that they have been working on Mill for ~10 years with a team ranging of 5 to 20 (from what I have heard) and have yet to get to silicon, while we have gone from a complete custom architectural idea to tapeout in ~11 months from closing our first seed funding.

The big technical failure point for Itanium (in my opinion) is the fact that Intel took the relatively pure VLIW research by Josh Fisher @ HP Labs and tried to add a ridiculous number of features (and attempted x86 compatibility) that impacted the ability to statically schedule instructions. The resulting bastard architecture Intel called "EPIC" (rather than VLIW) had a very difficult job in getting the compiler to generate instruction parallel code since Intel added a huge amount of indeterminism into the architecture that goes against the original VLIW tenets. If your compiler has to assume the worst case latency for all instructions and memory operations, you are going to have a bad time.

> while we have gone from a complete custom architectural idea to tapeout in ~11 months from closing our first seed funding.

To my understanding, the Mill project is not financed. They're enthusiasts working for sweat equity, and are likely going to seek (non-controlling?) investment to finally hit silicon when they're ready.

For the scope of what they're doing, I think it's a defensible enough approach. It's not something that can be created in evolutionary stages; all designs of all parts need to be working together properly for there to be benefit from any part, and it's quite complex while also trying out tons of novel designs.

(and the Mill isn't stack-based or stack-related. It's basically a crossbar of recent ALU/Load results being fed into further ALU/Store inputs in parallel. The belt is just some way to represent the set of recent results.)

Itanium failed for the same reason every other VLIW failed as a general purpose CPU: there just isn't enough information a compile time to model the dynamic properties of a program. In fact many of Itanium additions (strange instruction packing, alias disambiguation hardware) were attempts at overcoming this issue.

The only moderately successful general purpose VLIW are Conroe and the related Denver, and they use a runtime translation layer to collect the required dynamic informations.

The vast majority of the dynamic parts of program that matter for scheduling (both when it comes to ILP/avoiding hazards within a core and when it comes to handling memory management for our scratchpad based memory system) are due to indeterminate latencies for memory accesses and executing instructions (due to variable length pipelines). Throw in horrible (for determinism) things like out of order execution and and branch prediction and no wonder a compiler can't determine things statically! While we are not really targeting general purpose (though I would say we have the capability to evolve to it in the future) it seems painfully obvious to me where these issues have been in any general-leaning VLIW attempts in the past, and I can't understand the clinging nature to bad architectural decisions in the past by hardware folks 30 years ago that could not imagine the ability of software in the future. </rant>

Targeting general purpose from the get go is a bad idea, but it NOT impossible to do efficiently and without sacrificing performance. You just need a well defined and constrained architecture, and a clean way to describe it.

You have your causality relations reversed: the reason that branch prediction and dynamic caches exist is that because jump targets and working sets are hard to impossible to compute statically.

Even in the restricted world of HPC, GPGPUs have been moving from statically scheduled exposed pipeline VLIW machies to more conventional SIMD with caches, virtual memory and branch prediction (no meaningful OoO yet as the large amount of thread parallelism can hide the memory latency).

Also GPGPU have the benefit of having the large, lucrative GPU gaming market to pay for their development. How can a pure HPC machine be competitive in this market? Even for Intel Xeon Phi is more of a prestige project than actually meant to make money.

I've spent a long time debating with VLIW haters (that I presume you are with), but I'd love to see any citations you have for your claim that my causality is reversed, as I have a ton of evidence (to be fair not published yet) going for my side. While not as generally applicable as our architecture, you can take a look at basically any DSP from the past 15 years and see that VLIW works great from a performance and efficiency standpoint when your data is in a constrained form. We're showing that a compiler can structure a lot of different types of data (and the code required to actually operate on it) effectively if there are enough constraints on the hardware. Fairly pointless to try to convince you without documentation on hand for all parties, but hope you'll take a look in a couple of months.

As far as market, we are going after a decent sized market where the customers care the most about efficiency and performance, and are not only willing but very eager to switch their current solutions for whatever is best. As the typical startup claims, we are able to do it for a fraction of the cost and in a fraction of the time as one of the big guys, and have a solution that is 10x better than is out there. NVIDIA boasts that they spent $1 Billion developing the Pascal architecture, with them selling the Tesla series GPUs for it at $5,000+ a unit. We've shown we can prototype something that can theoretically beat it for under $2 million, and our hope/bet is that we can take it to market (and actually beat it by an order of magnitude) for less than $25 million. That's just HPC, which doesn't include the very interesting high end DSP area that is now using very expensive and power hungry FPGAs for wireless baseband solutions which we think are a very good fit for us.

Just to clarify: are you trying to compete with Nvidia, or with Intel? If you're going against GPUs, is your chip something that can run neural networks (better than Nvidia)?
VLIW have been used very successfully as DSPs for a long time, I do not think anybody is debating that. It is outside that niche that they have repeatedly been found lacking.

I'm sure your architecture would work fine for a subset of HPC problems like those that are currently run on a traditional GPGPU, but even in the HPC world many problems are ill suited for a GPU (think particle transport).

Yeah, something like this is very much needed, but it's not the hard part. The software is the hard part. The software is the reason we have the multiple levels of cache we have now. Without solving the software challenges, there can be no challenger for the existing architectures.

It's interesting to note that convolutional neural nets (CNNs) are one solution to the software challenge. It's an imperfect solution, in the sense that CNNs are not as general purpose (at the same efficiency) and have strict data requirements for training, but it is a solution, and the big N are investing heavily to the point of designing ASICs.

Eventually, though, we need to solve the software problem. That will require rethinking programming languages.

Having written programs for this iteration of the REX Neo architecture, the architecture is not so dramatically different that programming languages will have to be rewritten. I'm not the smartest programmer in the world and I was able to figure out the assembly language fairly easily.

Some concepts, like how to manage concurrent data processing and thread communications, need to be handled carefully, but that's more at the level of 'standard library' than the compiler. There is a clear pathway to getting C working on the architecture, and a reasonable direction (that will need some fleshing out) to getting performance-enhancing optimization of something like LLVM IR.

I wouldn't expect the assembly language level to be too far off from the common paradigms. Where I'd expect the software challenges to be would be in managing large amounts of memory, if the application programmer must manage shuffling data between the local scratchpad, specific locations in foreign scratchpads that must be (manually?) DMA'd around, and DRAMs.
Our whole goal, as talked about in the software section of our website (and the ACM paper linked in it), is to have the scratchpads be entirely automated by our toolchain. While we want to allow for especially adventurous programmers to have full freedom with the scratchpads, existing and future programs written in C/C++/other languages supported in the future will handle memory allocation identically (from the programmers perspective) as existing architectures.

One other thing to point out is that our actually addressing of a cores local scratchpad, as well as "foreign" scratchpads of other cores on the same chip and/or any other attached chip is handled exactly the same. All memory operations are handled through the exact same load/store instructions as part of a global flat address map that is the same for all cores in a system (one or multiple chips interconnected).

True. I shoulda added "so to speak", since this is a still more extreme approach and might simply break any compiler/language combination we have, as you say.
While we have been exploring some ideas on how to have better programming approaches to address the unique features of our architecture, we have from the beginning though that we would be required to have some level of portability for existing applications. As of right now, we support standard C/C++ that runs through our Clang+LLVM backend, with the ability to support any language that has a LLVM frontend.

Personally, I find the actor model to be the easiest existing way to take advantage of things like our network on chip and having hard time guarantees on memory movement. That being said, right now our focus is on C and C++ along with our API and custom library ports.

I recall an interview with someone formerly in upper management for the Itanium development project where he acknowledged that the most significant factor in the demise of Itanium was the exclusionary pricing structure Intel imposed on them.