Hacker News new | ask | show | jobs
by iskander 5524 days ago
>Olofsson has a new idea - or, specifically, a variation on an old one...it was common for a central processor to have a 'math co-processor' chip alongside it - a secondary processor which was designed specifically to carry out floating point arithmetic at speeds significantly faster than the main processor

This is exactly how people are currently using GPUs right now. How is this architecture better than a Fermi?

>"A guy straight out of college who's done a course in C programming can take a program and run it on our machine. There's no new constructs to run - you can take a program with legacy code and run it straight out of the box on our machine, and you can't do that on GPU."

If they're using only static compilation, this is very unlikely to be true. A few thousand Ph.D. theses have been sunk into parallelizing imperative programs. Despite the accumulation of sophisticated compiler techniques, it doesn't really work without extensive annotations and cooperation from the programmer. The programmer often ruins potential parallelism by accidentally creating dependencies between loop iterations. Even when analyzing ideal code, the program text doesn't contain sufficient information about the data size to create a good partition.

However, there's some small chance this isn't empty hype and they've actually made some cool breakthrough in runtime parallelization of imperative code. In that case, though, why would they be hyping vaporous hardware rather than just applying their fancy JIT compiler to existing multicore systems?

2 comments

"A few thousand Ph.D. theses have been sunk into parallelizing imperative programs."

This waste of talent continues to piss me off to no end. Why would people willingly spend time on this problem?

Because the reward of a breakthrough is extremely high.
I too doubt that they have any thing new with regard to parallelization of imperative programs. If the only way to utilize all cores is either by writing functionally or by manual parallelization then they don't really have a significant advantage over FPGA coprocessors. I do agree with them however when they say that this approach is better than creating lots of general purpose cores.
What no one has mentioned is that 4000 stacks is a lot of memory.
Not if they're done as split/segmented stacks (http://gcc.gnu.org/wiki/SplitStacks)-- basically, you have a collection of 4KB stack pages for each thread instead of one large up-front allocation, and you grow it as needed. It costs a few instructions per function entry/exit, but overall cost is negligible and it allows you to run thousands of coroutines without issues.
If you allocate the stacks contiguously using mmap then memory is only used as it is accessed. That's not the problem. The problem is that 4000 concurrent non-trivial threads is a resource hog no matter how the stacks are allocated.