Hacker News new | ask | show | jobs
by gravypod 10 days ago
What is more important than core count is how the caching architecture is laid out. They could lay out those 6k cuda cores in a layout which provides much larger blocks of cache to smaller number of cores. That would increase the memory bandwidth which would be better for inference.
1 comments

Sounds like the memory bandwidth is worse though;

> The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.

Also "cheap while delivering enough" certainly sounds like someone is trying to temper expectations. It sounds like something sitting in-between GPU+VRAM inference and CPU+RAM one, not as a step above/besides GPU+VRAM.

Having slower memory may not actually lead to lower memory bandwidth. The cuda cores can be broken up into compute complexes which larger blocks of memory directly attached to the cores. These could be filled with read operations from the bulk system memory. You can start executing and then page the next batch of data in while compute is working. For LLMs you don't have much random memory access, you can sequence your accesses in blocks.

If these chips become popular I am sure you will see LLM architectures taking advantage of the parallelism.

> The cuda cores can be broken up into compute complexes which larger blocks of memory directly attached to the cores.

Perhaps in theory, but for the gb10 stuff the memory is all on the CPU die and connected to the GPU die via nvlink-c2c