Hacker News new | ask | show | jobs
by nickpsecurity 852 days ago
Distributed, shared memory machines used to do exactly that in HPC space. They were a NUMA alternative. It works if the processing plus high-speed interconnect are collectively faster than the request rate. The 8x setups with NVLink are kind of like that model.

You may have meant that nobody has a stack that uses clustering or DSM with low-latency interconnects. If so, then that might be worth developing given prior results in other low-latency domains.

2 comments

> Distributed, shared memory machines used to do exactly that in HPC space.

reformed HPC person here.

Yes, but not latency optimised in the case here. HPC is normally designed for throughput. Accessing memory from outside your $locality is normally horrifically expensive, so only done when you can't avoid it.

For most serving cases, you'd be much happier having a bunch of servers with a number of groqs in them, than managing a massive HPC cluster and trying to keep it both up and secure. The connection access model is much more traditional.

Shared memory clusters are not really compatible with secure enduser access. It is possible to partition memory access, but its something thats not off the shelf (well that might have changed recently.) Also, shared memory means shared fuckups.

I do get what you're hinting at, but if you want to serve low latency, high compute "messages" then discrete "APU" cards are a really good way to do it simply (assuming you can afford it). HPCs are fun, but its not fun trying to keep them up with public traffic on them

It would probably be a cluster of thin nodes with GPU’s or low-cost accelerators over a low-latency interconnect. The DSM would be layered on top of that. The AI cluster would handle processing with security, etc done more by other components. They’re usually layered.

I agree it’s harder to manage with less, fine-grained security. People were posting Groq chips at $20k each, though. With that, we’re talking whether the management of it is worth it for installations costing six or more digits. That might be more justifiable if an alternative saves them a good chunk of six or more digits.

Their main advantage is a solution that’s ready to go :)

I think existing players will have trouble developing a low latency solution like us whilst they are still running on non-deterministic hardware.
While you’re here, I have a quick, off-topic question. We‘ve seen incredible results with GPT3-176B (Davinci) and GPT4 (MoE). Making attempts at open models that reuse their architectural strategies could have a high impact on everyone. Those models took 2500-25000 GPU’s to train, though. It would be great to have a low-cost option for pre training Davinci-class models.

It would great if a company or others with AI hardware were willing to do production runs of chips sold at cost specifically to make open, permissive-licensed models. As in, since you’d lose profit, the cluster owner and users would be legally required to only make permissive models. Maybe at least one in each category (eg text, visual).

Do you think your company or any other hardware supplier would do that? Or someone sell 2500 GPU’s at cost for open models?

(Note to anyone involved in CHIPS Act: please fund a cluster or accelerator specifically for this.)

Great idea, but Groq doesn't have a product suitable for training at the moment. Our LPUs shine in inference.
What do you mean by non-deterministic hardware? cuBLAS on a laptop GPU was deterministic when I tried it last iirc
Tip of the ice-berg.

DRAM needs to be refreshed every X cycles.

This means you don't know the time it takes to read from memory. You could be reading at a refresh cycle. This circuitry also adds latency.

OP says SRAM, which doesn't decay so no refreshing.
Timing can simply mean the FETs that make up the logic circuits of a chip. The transition from high to low and low to high has a minimum safe time to register properly...
Non-deterministic timing characteristics.