Hacker News new | ask | show | jobs
by zackmorris 1040 days ago
Since nobody asked, I'm reiterating my position that computers to effectively utilize parallel functionality simply aren't available today. I've always wanted a computer with at least 256 cores and local content-addressable memories beside each core to send data where it's needed. By Moore's Law, we could have had MIPS machines with 1000 cores around 2010, and 100,000 to 1 million cores today, for under $1000.

Contrast that with GPU shaders where one C-style loop operates on buffers separate from system memory, and can't access system services like network sockets or files. GPUs have around 32 or 64 physical cores, so theoretically that many shaders could run simultaneously, although we rarely see that in practice. And we'd need bare-metal drivers to access the GPU cores directly, does anyone know of any?

The closest thing now is Apple's M1 line, but it has specialized NN and GPU cores, so missed out on the potential of true symmetric multiprocessing.

The reason I care about this so much is that with this amount of computing power, kids could run genetic algorithms and other "embarrassingly parallel" code that solves problems about as well as NNs in many cases. Instead we're going to end up with yet another billion dollar bubble that locks us into whatever AI status quo that the tech industry manages to come up with. And everyone seems to love it. It reminds me of the scene in Star Wars III when Padme notes how liberty dies with thunderous applause.

8 comments

1) Amdahl's law means it's not useful to have hundreds of cores for general purpose computing. There's not that much parallel work to do in typical applications. Increasing the proportion of work that's parallelizable for a given application pays dividends when you have more cores - that's why Servo is so exciting. In some cases, picking an O(n2) algorithm that's easy to parallelize will be faster than a less parallizable O(nlog(n)) algorithm - this is true for problems like Single-Source Shortest Paths (SSSP).

2) Shared resources (in-memory mutable data, hardware devices) mean the ratio of contention to CPU work goes up when you have more cores.

3) Cores on a single die need to share the same constraints - thermal limits and transistor count. So you're best off having enough powerful cores to get you to a sweet spot of single-core performance vs multi-core parallelism.

4) It's hard to provide a performant and useful many-core machine model. Cache coherence makes it easier to program a many-core machine, but limits performance. Without it, you're stuck with distributed systems-style problems.

This exists now. Some AI accelerators are a grid of independent compute units with their own memory, message passing between them. Graphcore's IPU is an instance.

An AMD GPU is a grid of independent compute units on a memory hierarchy. At the fine grain, it's a scalar integer unit (branches, arithmetic) and a predicated vector unit, with an instruction pointer. Ballpark of 80 of those can be on a given compute unit at the same time, executed in some order and partially simultaneously by the scheduler. GPU has order of 100 compute units, so that's ~8k completely independent programs running at the same time.

You've got a variety of programming languages available to work with that. There's a shared address space with other GPUs and the system processors, direct access to system and GPU local memory. Also some other memory you can use for fast coordination between small numbers of programs.

There's a bit of a disconnect between graphics shaders, the ROCm compute stack and what you can build on the hardware if so inclined. The future you want is here today, it just has a different name to what you expected.

K if I can transpile C/C++, Rust or TypeScript to that and have full access to memory, threads, system APIs, network sockets, etc, then that would work for the use cases I have in mind. Running MIMD processes on SIMD hardware is something I'm definitely interested in.

If there's no straightforward way to do that, then I'm afraid that hardware represents a huge investment in the wrong direction.

Because a GPU can be built from the general-purpose multicore CPU I'm talking about. But a CPU can't be built from a GPU.

What I'm getting at is that if I have to "drop down" to an orthodox way of solving problems, rather than being able to solve them in the freeform way that my instincts leads me, then I will always be stifled.

1000 cores?? I don't have 100 cores! What do you even need 10 cores for? Well, here's 4 cores. Give 2 to your brother. Don't go wasting all those hyper threads all at once!

Intel ca. 2010, probably

Also Intel: ECC memory support? In this economy?
Sorry but we do have computers with 256 cores. I used to have this excuse back when processors only had 4 cores. When you consider that processors lower their turbo boost frequency as you use more cores and there is overhead from synchronization, your 4 core processor may only give you a 2x performance benefit at the expense of your code becoming difficult to reason about (depending on the problem at hand). Nowadays 8 core processors are quite cheap, below 200€. At 4x performance boost and easily 12x more if you are willing to spend the money, it is definitively worth it. The caveat of course is that there aren't actually that many programs that need the full power of your processor. The most common exception is a video game that was developed for a limited number of players or even single player but then the multiplayer version of the game becomes extremely popular and you get servers with 60 or even a hundred players, way beyond what the developers planned to support. Supporting multiple cores was not a priority and then very suddenly it becomes the biggest bottleneck.

The real problem we are facing is that our programming models aren't parallel by default.

>By Moore's Law, we could have had MIPS machines with 1000 cores around 2010, and 100,000 to 1 million cores today, for under $1000.

https://corescore.store/

You can have 10000 RISC-V cores on an FPGA but nobody cares. Why? Because even a bit serial processor (that means it processes one bit per clock cycle, or 32 clock cycles for a 32 bit addition) runs into memory bandwidth limitations very quickly if you have enough of them. Main memory is very slow compared to registers and caches. The only way to utilize this many cores is by having a workload that is entirely latency bound. Your memory access pattern is perfectly unpredictable. The moment you add caching, the number of cores you can have shrinks dramatically and companies like AMD are not slimming down their CPUs, they are adding more and more cache. Their highest end processors have almost a gigabyte of cache.

That's really awesome, thank you!

I agree about the programming models not being parallel by default, and that's one of the things that I specifically rail against in most of my comments. MATLAB/Octave is a good introduction to what parallel programming could be. Also the endless doubling down on large caches, because the multicore design I have in mind would mostly eliminate cache and use that die area for cores and local memories.

I think we're slightly talking past each other here though. The CPU I want to build would have around 10-256 cores on 90s tech. So the same transistors holding 1 Pentium Pro would allow for 1-2 orders of magnitude more MIPS or RISC-V cores and local memories. The design is so simple that I think that's why it was missed by the big fabs.

Today there's little demand for 1000+ cores, but that's partly because nobody can see what they could do. But we can't design the thing, because the status quo has us all working pedal to the metal in first gear to make rent. It's a chicken and egg problem that has a lower likelihood of being solved as time goes on. Which is why I think we're on the wrong timeline, because if the system worked then actual innovation would become more accessible over time.

Intel Labs experimented with many low power cores vs. fewer, faster high power cores back around 2009-2010.

https://www.zdnet.com/article/experimental-intel-chip-could-...

Programmability is always the biggest issue, and that's not really a chicken-and-egg problem because decades of research have gone into writing compilers and languages for massively parallel machines -- it's just hard, some would say intractable (and local memories tend to make programmability issues worse.) There are niche or embarrasingly-parallel problems that will run great. But it's hard to sell hardware that will solve only some of your problems well. And GPUs have taken over for many of those very regular problems as well.

Arguing about where we should be based on a projection of an empirical exponential curve seems pretty irrational. Nothing in reality is exponential forever.
Typical GPUs are easily 6000+ shaders (aka kinda-sorta like cores) on the more expensive end.

At least, 6000+ 32-bit multiplies per clock tick on ~2GHz+ clocks. Even cheap GPUs easily are 2000+ shaders.

> GPUs have around 32 or 64 physical cores

NVidia SMs and AMD WGPs are not "cores", they are... weird things. They have many shaders inside of them and have huge amounts of parallelism.

As far as grunt-work goes, a "multiplier unit" (literally A x B) is perhaps the most accurate count to compare CPU cores vs GPU "cores", because the concept of CPU-core vs GPU WGP / SM is too weird and different to directly compare.

Split up that WGP / SM into individual multipliers... and also split up the ~3 64-bit multipliers or ~48 CPU SIMD multipliers per core (3x 512-bit on Intel AVX512 cores), and its perhaps a more fair comparison point.

---------

Back 20 years ago, you'd only have 1x multiplier on a CPU core like a Pentium 4, maybe as many as 4x with the 128-bit SSE instructions.

But today, even 1x core from Intel (3x 512-bit SIMD) or 1x core from AMD (4x 256-bit SIMD) has many, many, many more parallel elements compared to a 2004-era CPU core.

>NVidia SMs and AMD WGPs are not "cores", they are... weird things. They have many shaders inside of them and have huge amounts of parallelism.

They aren't weird things. They are the equivalent of CPU cores. By your logic CPU cores aren't CPU cores, "they are... weird things" because of SMT.

There is more weirdness here than just SMT.

The full crossbar, allowing each shader to individually issue a fetch from memory. The shared memory space is not like cache but instead is a shader-to-shader communication scratchpad.

Atomics support, coalescing atomics together.

-------

I mean hell: what is a core? Do remember that on SMs, every single shader (not SM) has its own instruction pointer.

Is the shader a core? No, not really. But SMs aren't a core either.

I wouldn't compare GPU and CPU architecture at all. They're just different. What I did above, breaking both down into individual multipliers then counting them seems like the best way forward, especially as we remain multiplier bound in practice.

Read a lot of this kind of post. Years ago I recall someone bleating for 8 cores when 1 or 2 was the norm. Now you want 256. Next generation will ask for thousands. All for nothing because you have no idea what to do with it except give the handwaviest justifications. A computer's a tool to do an actual job. You can and probably do have more computing power on your desktop than all the world's supercomputers put together from the 1970's.

https://en.wikipedia.org/wiki/Cray_X-MP

   Price US$7.9 million in 1977 (equivalent to $38.2 million in 2022)
   Weight 5.5 tons (Cray-1A)
   Power 115 kW @ 208 V 400 Hz[1]
   CPU 64-bit processor @ 80 MHz[1]
   Memory 8.39 Megabytes (up to 1 048 576 words)[1]
   Storage 303 Megabytes (DD19 Unit)[1]
   FLOPS 160 MFLOPS
In 2070 it still won't be enough for you. It never will be enough.
Have you considered finding a Connection Machine?