| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by matanyall 856 days ago
	Groq Engineer here, I'm not seeing why being able to scale compute outside of a single card/node is somehow a problem. My preferred analogy is to a car factory: Yes, you could build a car with say only one or two drills, but a modern automated factory has hundreds of drills! With a single drill, you could probably build all sorts of cars, but a factory assembly line is only able to make specific cars in that configuration. Does that mean that factories are inefficient? You also say that H200's work reasonably well, and that's reasonable (but debatable) for synchronous, human interaction use cases. Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

8 comments

pbalcer 856 days ago

Just curious, how does this work out in terms of TCO (even assuming the price of a Groq LPU is 0$)? What you say makes sense, but I'm wondering how you strike a balance between massive horizontal scaling vs vertical scaling. Sometimes (quite often in my experience) having a few beefy servers is much simpler/cheaper/faster than scaling horizontally across many small nodes.

Or I got this completely wrong, and your solution enables use-cases that are simply unattainable on mainstream (Nvidia/AMD) hardware, making TCO argument less relevant?

tome 856 days ago

We're providing by far the lowest latency LLM engine on the planet. You can't reduce latency by scaling horizontally.

nickpsecurity 856 days ago

Distributed, shared memory machines used to do exactly that in HPC space. They were a NUMA alternative. It works if the processing plus high-speed interconnect are collectively faster than the request rate. The 8x setups with NVLink are kind of like that model.

You may have meant that nobody has a stack that uses clustering or DSM with low-latency interconnects. If so, then that might be worth developing given prior results in other low-latency domains.

KaiserPro 855 days ago

> Distributed, shared memory machines used to do exactly that in HPC space.

reformed HPC person here.

Yes, but not latency optimised in the case here. HPC is normally designed for throughput. Accessing memory from outside your $locality is normally horrifically expensive, so only done when you can't avoid it.

For most serving cases, you'd be much happier having a bunch of servers with a number of groqs in them, than managing a massive HPC cluster and trying to keep it both up and secure. The connection access model is much more traditional.

Shared memory clusters are not really compatible with secure enduser access. It is possible to partition memory access, but its something thats not off the shelf (well that might have changed recently.) Also, shared memory means shared fuckups.

I do get what you're hinting at, but if you want to serve low latency, high compute "messages" then discrete "APU" cards are a really good way to do it simply (assuming you can afford it). HPCs are fun, but its not fun trying to keep them up with public traffic on them

nickpsecurity 855 days ago

It would probably be a cluster of thin nodes with GPU’s or low-cost accelerators over a low-latency interconnect. The DSM would be layered on top of that. The AI cluster would handle processing with security, etc done more by other components. They’re usually layered.

I agree it’s harder to manage with less, fine-grained security. People were posting Groq chips at $20k each, though. With that, we’re talking whether the management of it is worth it for installations costing six or more digits. That might be more justifiable if an alternative saves them a good chunk of six or more digits.

Their main advantage is a solution that’s ready to go :)

tome 856 days ago

I think existing players will have trouble developing a low latency solution like us whilst they are still running on non-deterministic hardware.

nickpsecurity 856 days ago

While you’re here, I have a quick, off-topic question. We‘ve seen incredible results with GPT3-176B (Davinci) and GPT4 (MoE). Making attempts at open models that reuse their architectural strategies could have a high impact on everyone. Those models took 2500-25000 GPU’s to train, though. It would be great to have a low-cost option for pre training Davinci-class models.

It would great if a company or others with AI hardware were willing to do production runs of chips sold at cost specifically to make open, permissive-licensed models. As in, since you’d lose profit, the cluster owner and users would be legally required to only make permissive models. Maybe at least one in each category (eg text, visual).

Do you think your company or any other hardware supplier would do that? Or someone sell 2500 GPU’s at cost for open models?

(Note to anyone involved in CHIPS Act: please fund a cluster or accelerator specifically for this.)

tome 856 days ago

Great idea, but Groq doesn't have a product suitable for training at the moment. Our LPUs shine in inference.

WanderPanda 856 days ago

What do you mean by non-deterministic hardware? cuBLAS on a laptop GPU was deterministic when I tried it last iirc

frozenport 856 days ago

Tip of the ice-berg.

DRAM needs to be refreshed every X cycles.

This means you don't know the time it takes to read from memory. You could be reading at a refresh cycle. This circuitry also adds latency.

tome 856 days ago

Non-deterministic timing characteristics.

huac 856 days ago

> 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

I believe that this is doable - my pipeline is generally closer to 400ms without RAG and with Mixtral, with a lot of non-ML hacks to get there. It would also definitely be doable with a joint speech-language model that removes the transcription step.

For these use cases, time to first byte is the most important metric, not total throughput.

qeternity 856 days ago

It’s important…if you’re building a chatbot.

The most interesting applications of LLMs are not chatbots.

chasd00 856 days ago

> The most interesting applications of LLMs are not chatbots.

What are they then? Every use case I’ve seen is either a chatbot or like a copy editor which is just a long form chatbot.

jasonjmcghee 856 days ago

Obviously not op, but these days LLMs can be fuzzy functions with reliably structured output, and are multi-modal.

Think about the implications of that. I bet you can come up with some pretty cool use cases that don't involve you talking to something over chat.

One example:

I think we'll be seeing a lot of "general detectors" soon. Without training or predefined categories, get pinged when (whatever you specify) happens. Whether it's a security camera, web search, event data, etc

nycdatasci 856 days ago

Complex data tagging/enrichment tasks.

throwaway2037 856 days ago

> The most interesting applications of LLMs are not chatbots.

In your opinion, what are the most interesting?

treprinum 856 days ago

> Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia

I built one, should be live soon ;-)

tome 856 days ago

Exciting! Looking forward to seeing it.

startupsfail 856 days ago

I have one, with 13B, on a 5-year-old 48GB Q8000 GPU. It’s also can see, it’s LLaVA. And it is very important that it is local, as privacy is important and streaming images to the cloud is time consuming.

You only need a few tokens, not the full 500 tokens response to run TTS. And you can pre-generate responses online, as ASR is still in progress. With a bit of clever engineering the response starts with virtually no delay, the moment its natural to start the response.

yaknh 856 days ago

Did you find anything cheaper for local installation?

jrflowers 856 days ago

>Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

Is your version of that on a different page from this chat bot?

mlazos 856 days ago

You can’t scale horizontally forever because of communication. I think HBM would provide a lot more flexibility with the number of chips you need.

fennecbutt 856 days ago

Are there voice responses in the demo? I couldn't find em?

tome 856 days ago

Here's a live demo of CNN of Groq plugged into a voice API

https://www.youtube.com/watch?v=pRUddK6sxDg&t=235s

fennecbutt 855 days ago

Thanks, that's pretty impressive. I suppose with blazing fast token generation now things like diarisation and the actual model are holding us back.

Once it flawlessly understands when it is being spoken to/if it should speak based on the topic at hand (like we do) then it'll be amazing.

I wonder if ML models can feel that feeling of wanting to say something so bad but having to wait for someone else to stop talking first ha ha.

sinuhe69 855 days ago

Wow! Absolutely astounding!

chaunnyong 854 days ago

Hi Matanyal, we worked with groq for a project last year. would you be open to connect on LinkedIn? :)