| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kgeist 25 days ago

I administer a simple AI server in the office, which just uses a single RTX 5090 but is able to serve ~80 people throughout the day. I'm impressed by Qwen3.6-27b's capabilities in agentic coding/tasks so far. Devs say it's not much different from Sonnet 4.6 on many tasks (sometimes it even outperformed it), 40-60 tok/sec, up to 260k context. The server cost about $10k with all the bells and whistles.

I spent a lot of time researching/adding/benchmarking many custom modifications to the software stack and its settings to make the server optimally handle the load with just 1 RTX 5090 without losing quality, but it's still not enough, and the wait times in the queue are getting longer. We're at the limits of the hardware, and I'm out of tricks.

The experiment was kind of a success, and the CTO agrees we should scale it. With our own infra, we could run agents 24/7 on everything. Currently, a lot of use cases for the cloud providers are completely blocked by PII/trade secret concerns (our infosec department doesn't buy the "zero retention" promise), plus you don't have to think about billing/budgets/etc. anymore.

Now I can't decide how to scale it. On one hand, I'd like to run larger models. And we have the budget to buy, say, 8xH200. But in many benchmarks, the larger models that do fit in 8xH200 comfortably and can serve many parallel requests with acceptable speed/quality don't seem to outperform Qwen3.6 that much in agentic coding/tasks to justify the price.

So another option is just to buy a bunch of RTX 6000s and scale horizontally instead: run a copy of a midrange LLM like Qwen3.6 on each GPU. It's cheaper and easier to scale/replace, but then we'll run into problems running larger models in the future if we have to, because of no NVLink support (say, if Alibaba & Co. stop releasing ~30b models and/or ~30b models start falling behind 400b+ models considerably)

Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)

11 comments

anon373839 25 days ago

> our infosec department doesn't buy the "zero retention" promise

They are wise to be skeptical! It is neither a promise nor zero data retention.

Look at Anthropic's Zero Data Retention policy -- and remember, this is the policy that applies to the exclusively eligible enterprise partners who can even qualify for a ZDR agreement with Anthropic:

> When ZDR is enabled, prompts and model responses generated during Claude Code sessions are processed in real time and not stored by Anthropic after the response is returned, *except where needed to comply with law or combat misuse*.

> Even with ZDR enabled, Anthropic may retain data where required by law or to address Usage Policy violations. If a session is flagged for a policy violation, *Anthropic may retain the associated inputs and outputs for up to 2 years*....

This means that Anthropic is actively inspecting all of your data with machine learning classifiers. When the usage is flagged for whatever reason as violating any aspect of Anthropic's Usage Policy, then they get to keep your data for 2 years, with no apparent limitation on what they can then use it for.

Crucially, you have ZERO guarantees about the sensitivity or specificity of these classifiers. For all anyone knows, Anthropic is silently flagging 75% of queries and retaining the data.