Hacker News new | ask | show | jobs
by simplyluke 23 days ago
> that you can run locally

That's doing a lot of work here.

The future I see isn't most companies buying hundreds of thousands in hardware to run models, it's them adding a line item to their AWS bill. Inference costs on the larger hosted open source models are dramatically lower than the frontier labs API pricing.

3 comments

The future I'm seeing is AI coprocessors running inference locally in most devices that today have a CPU. Just look at how powerful your mobile phone has become compared to your desktop computer 15 years ago and compared to a main frame 30 years ago.

The days of requiring a data center to run anything resembling opus 4.6 are already counted. (But the industry will fight hard to get people to keep paying the Claude tax.)

I'm already running a google TPU over USB on an otherwise very cheap board to do local computer vision on a front-door camera since I wanted to get away from Ring and other cloud services for that use case.

And yeah, that may be the ~decade world, but we're in the mainframe era of the frontier models. It's going to be more economical for basically any consumer, and most businesses, to pay someone else to host models for quite a while.

A gaming PC can already host models that perfectly serve casual users who just want recipes, todo tracking, picture identification, etc. E.g. Qwen 3.6 35b which will run on a $650 GPU at 75 t/s (Nvidia 1660 ti 16GB).

Said model will also run as a tool-calling coding model excellently (it's no Opus, but for a thing that once set up is just the cost of energy, it's incredible). It can type faster than you can, probably 10x faster, so with guidance it'll make you faster. And it's free.

It's here. If folks want ChatGPT without a subscription, they can have it today on their computer. The only money to be made is in the high end models doing "serious business" work spanning 1M+ token contexts and massive uncertainty. Everything else is already set to be eaten by today's local models.

The problem with models like Qwen 3.6 35B (which really is an excellent model) is that my expectations of what a model can do have gone SO high now.

Here's a prompt I just ran against Claude Opus 4.7:

> Use python3 to experiment with whether the SQLite3 authorizer mechanism can be used to detect an INSERT OR REPLACE based just on running an explain query without examining the SQL string itself

Opus nailed it: https://claude.ai/share/c4212606-3fee-4b7c-bc97-505e0348ccac

I tried the same thing against qwen/qwen3.5-35b-a3b running locally in lmstudio, with the Pi coding agent. At first it looked like it was going to do great! And then it fell apart over the course of several tool calls: https://gisthost.github.io/?8ae2f842df619fb7fd8f1ccd82fe41c7

I'm used to GPT-5.5 and Opus 4.7 handling that kind of prompt without any problems at all.

Something is definitely going wrong with your Qwen setup, in the link you posted it starts and ends with a compaction step due to a 4k token context limit. Qwen 35b supports I think up to 200k+ context limit (though I run only with 128k), that seems to be a major source of the problem.
Good call, I need to check if LM Studio is misconfigured.
This worked for me with qwen3.6-36b-a3b even at a q4 quant. I ran pi in a docker container and it had to figure out how to install python as well. I used the same initial prompt you had without any additional. You talked about Qwen 3.6, but then said you tried Qwen 3.5 in lmstudio. Not sure if you meant Qwen 3.6. I ran with llama.cpp llama-server with the recommended settings from unsloth.

I'm not an expert in SQLLite so I can't say if this is 100% correct, but it seemed directionally similar to the conclusion from claude.

  ### TL;DR
  
  - Authorizer + EXPLAIN:  No — authorizer only sees SQLITE_INSERT, not VDBE opcodes
  - EXPLAIN opcode analysis alone:  Yes — Delete opcode at position 10 is the unique signature of INSERT OR REPLACE / REPLACE
I can't help but think the not-so-distant future will see language models expected on commodity personal computing devices.
OK that's a very good answer! Do you mind sharing the transcript?
So one of the prominent LLM advocates known for testing every model shared a prompt intended to exhibit Opus 4.7 capabilities, and Qwen 3.6 sorted it out okay? Interesting.

Not saying they're equivalent, local models still decohere much quicker as the context grows in my experience. But... Interesting.

Thats when your build a better Ralph loop around your llm for it to converge to an answer and not rely on 1 shots
> a thing that once set up is just the cost of energy

I don't think we can discount this, frankly. Newer electronics are energy efficient, but older devices are more energy-intensive, and unless configured well, a gaming PC can easily use a few dollars a month in electricity, so now you're approaching subscription territory. A subscription comes with no upfront cost, higher reliability, no wasted space in your home, mobile apps, etc. (and less privacy).

Curious why you went for a custom solution. I am aware of at least one company that seems to ship devices with local computer vision (Reolink).
My experience over the past decade has been being subsequently burned by being reliant on one provider's ecosystem after another. This is great until Reolink starts doing something shady to pad the bottom line and then it's on to the next.

I wanted the ability to run whatever cameras on a VLAN and own the stack.

I'm guessing that they are using Fargate which is an OSS NVR. It supports a little addon USB stick you can buy for about $30 that will run common computer vision tasks for object detection. Stuff that we've been able to do with WebAssembly and Canvas for a long time now.
> But the industry will fight hard to get people to keep paying the Claude tax.

I bet this will ironically be couched in "safety" reasons or regulation to get anti-AI folks on board, even if it favors the large incumbents.

Counted but not yet numbered?
Even when run on datacenters, it would be like current day webhosting. It is hyper competitive and it will be a race to the bottom. There is money to be made but not as much as investors hope. There will be datacenters in random countries like Kazakhstan because some oligarchs have found a free energy glitch (like with bitcoin mining).
Magical thinking. I guess if your phone is going to have 128gb of dddr5 then sure. You people fundamentally don't understand the memory requirements for running inference. Your cute local models seem good enough because you have no standards and anything an LLM produces seems like magic to you.
> Magical thinking. I guess if your phone is going to have 128gb of dddr5 then sure.

Why would it not? The typical new phone today has 16gb of RAM. 20 years ago that was somewhere around 32mb. Factor 512. It's not hard to see that we'll get there rather soon, especially if there is an application that provides demand.

> You people fundamentally don't understand the memory requirements for running inference.

You seem to be overlooking how fast things change in this industry, especially if tons o money can be made as a consequence.

> Your cute local models seem good enough because you have no standards and anything an LLM produces seems like magic to you.

Please don't generalize. I'm an expressed AI skeptic and have to deal with the bad consequences of AI slop every day. But you can't deny that there are enough applicationn areas where people have use cases and those will be much easier if things don't need a few round trips to a data center that sucks all the electricity and water out of neighboring communities.

Eh, you're off by an order of magnitude or so on both ends.

The iPhone 17 has like 8 gb, the Pixel 10 12.

The original iPhone was 128mb, and the iPhone 6 from 2016-2018 was around 1gb; that puts the iPhone at around 8x RAM per decade, and puts us at 128gb in our pockets at around 2036 or so.

(Incidentally, the big news in phone RAM is that a lot of new phones are dropping back to 4gb because of RAM shortages.)

> it's them adding a line item to their AWS bill

That's the future Amazon sees too. We just had a week long session with the AWS team and they pushed that to us multiple times.

Buying "hundreds of thousands in hardware" sounds like a lot but many companies - especially software companies - already do that if they have 100+ employees.

Running software in the cloud gives you certain reliability and scaling advantages that would be very hard to replicate locally. Running some code agents in the cloud vs local hardware, if the local hardware gets "good enough," breaks the other way - offline usage, alone, would be hugely valuable to many people and companies.

It'd be very interesting to see where various players would decide to make a call "local is good enough" though. Buying the hardware isn't a small bet, if it's not something that ends up as part of your standard computer.