| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by reisse 35 days ago

> They will be, and that moment is not that far off.

It's here, right now. I'm running quantized Qwen and Gemma on a decent, but three years old gaming rig (think RTX 3080 12GB and 32 GB RAM). Yes, it's slow, it has a small context window. But it can (given a proper harness) run through my trip photos and categorize them. It can OCR receipts and summarize spendings. It can answer simple questions, analyze code and even write code when little context is required. Probably I could get a half-decent autocomplete out of it, if I bother with VS Code integration. "128 GB VRAM on a MacBook Pro or a Strix Halo" is already a minimum viable setup for agentic coding, I think.

> And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed.

Currently, it works exactly the other way. The cloud versions are orders of magnitude cheaper than self hosting, because sharing can utilize servers much more efficiently. Company can spend half a million bucks on a rig running GLM 5.1, and get data security, flexibility and lack of censorship, but oh it's so expensive compared to Anthropic per-seat plans.

12 comments

pbgcp2026 34 days ago

I'm sorry to spoil it for you, but Perl script was able to do all of that like ... 10 years ago? The out-of-the-box Shotwell manages photos quite well without any intelligence. The problem, as people mentioned above, is SOTA models cognitive and tooling abilities. Also, have you noticed as top-end Mac Studios got downgraded recently? They don't want you to have access to frontier models. And you will not have it. See Mythos as Exibit A.

jclardy 34 days ago

The Mac Studio's disappearance is related to the fact that people now want them for the purpose of running local models. Supply and demand. That plus Apple doesn't shift prices for released products, and it essentially became underpriced when large RAM quantities exploded in price. For the price of 512GB of RAM alone you could get an M3 Ultra with 512GB of unified memory in a nice, quiet, and power efficient package. With the RAM you still need to spend a few thousand more on CPU/GPU, power supplies, storage and case.

Also the fact that an M5 version will be coming, and they likely know they are going to sell out on day one (I expect we'll see a price correction from Apple for higher end configs of M5 studios, base price will probably stay the same), so they need to build up stock reserves.

lowbloodsugar 34 days ago

512GB of ram with I think 600GB/s access. It’s the bandwidth that makes the studio killer for inference.

zigzag312 34 days ago

> The out-of-the-box Shotwell manages photos quite well without any intelligence.

This piqued my interest on how it does it and after briefly checking the project it seems it only has two features for automatic photo categorization. 1) it can group photos by date and 2) It has face detection and recognition that uses trained weights (so ML "intelligence").

mystraline 34 days ago

Immich (server) also has a whole host of ML features for classification as well.

I got away from google images and upload to my own Immich instance.

I also use an open source camera app on fdroid to degoogle that whole path.

IMTDb 34 days ago

> They don't want you to have access to frontier models. And you will not have it. See Mythos as Exibit A.

"They" fully well know that they current frontier model are maybe 6 month ahead of what people will have access to without their control. See Deepseek as Exibit B

The reason you can't run these locally are more with the fact that those mythos sized models require extreme amount of memory and processing power to run at acceptable speeds. And neither you, nor I can afford to pay for those resources to run those models locally. A big reason is that "running locally" means running on your own hardware. And for almost everyone this means "running on hardware that will spent a big portion of its time just sleeping". Because data center and providers have higher utilization rates, they can easily outpace you. That and the fact that when they place an order it's usually for hundreds of thousands of units.

PeterStuer 34 days ago

I am convinced the (mainly chinese) open weights models are the only reason OpenAI and Anthropic release at the pace they do. Without them being on their heels, we would have seen a stagnant duopoly in terms of public releases.

That is why the huge lobby machine is grinding away to make those models illegal.

bee_rider 34 days ago

Although, I wonder how many orders of magnitude in terms of affordability the utilization rate actually gets them. Realistically if you use a self-hosted LLM for your job, you might be using it, what, a solid 6 hours per day? Assuming you can keep it actually fed, while working (so, some agentic thing might be necessary, I guess it will need to be more than VSCode autocomplete and responding to individual prompts). Anyway, that starts you out at 1/4’th the utilization, a 4X price increase might be worth paying for privacy and stability (no sudden change in model behavior, no price changes, no days when the system is over-utilized for reasons outside your control).

Rather I think it is just hard for local LLMs to compete in this early stage when the cloud providers are allowed by investors to be unprofitable.

zozbot234 34 days ago

> Realistically if you use a self-hosted LLM for your job, you might be using it, what, a solid 6 hours per day?

You can grow the utilization rate well beyond that if you don't always care about getting a quick, real-time response. (And if you do, then maybe the cloud model was the better deal after all!)

hedora 34 days ago

Isn't Mythos that screw up where Anthropic failed to ship something that was no better than the product OpenAI launched a few weeks later?

And, assuming the allegations are true, don't things like Deepseek and Qwen offer existence proofs that frontier models are (and will forever be) trivially distilled down to run domain-specific tasks on boxes that cost a few months of Claude Max subscription?

Hamuko 34 days ago

>Also, have you noticed as top-end Mac Studios got downgraded recently? They don't want you to have access to frontier models. And you will not have it.

Isn't that a function of RAM supply not being available now?

aceazzameen 34 days ago

OpenAI did buy out the RAM supply to block competition. Arguably local models are one of its (smaller) competitors.

Even if that weren't the case, every corp _needs_ you to be on a subscription.

Hamuko 34 days ago

They didn't really even buy the RAM. But there's pretty significant demand for RAM in general with data centers being planned left and right.

tjoff 34 days ago

Do we even have decent OCR nowadays? Any free solutions?

Farmadupe 34 days ago

The latest rounds of open weights vision language models are incredibly good. Like, massively good. Open weights vision capabilities trade blows with frontier models. Over the last few months I'd roughly rank capabilities as Gemini -> {chatgpt and SoTa open weights models} -> Claude.

qwen3.5-2b and qwen3.5-4b are great at document parsing. They can run on CPU

qwen3.6-27b and gemma4-31b are borderline better than the human eye in some cases. Their OCR isn't perfect, but they're seriously good. They can still run on the CPU but you'll be waiting minutes per document.

You can demand JSON, YAML, MD, or freeform text just by varying the prompt. Even if you have a custom template, you can just put that in the prompt and they'll do an OK-ish job.

There's also models that aren't in the r/locallama zeitgeist. IBM released a new 4b parameter model for structured text extraction last week, and there's a sea of recent chinese OCR models too.

IMO the open wights models are so good that in a lot of cases it's not worth paying frontier labs for OCR purposes. The only barrier to entry is the effort to set up a pipeline, and havin the spare CPU/GPU capacity.

adrian_b 34 days ago

Many of the open-weights LLMs accept either text or images as input.

Besides those, there are a few smaller open-weights models that are dedicated for OCR tasks, for instance DeepSeek-OCR-2 and IBM granite-vision-4.1-4b. (They can be found on huggingface.co)

The dedicated vision models can be run on much cheaper hardware, including smartphones, than the big models that can process images besides text.

Similarly, besides bigger multimodal models, that can accept audio, images or text as imput, there are smaller open-weights models that are dedicated for speech recognition, e.g. Xiaomi MiMo-V2.5-ASR and IBM granite-speech-4.1-2b.

PeterStuer 34 days ago

Depends on your use case. My procuction runs satisfactory on a local docling-serve ( https://github.com/docling-project/docling-serve ), but that is mostly easy relatively clean scans of decently typeset documents with some typical scanning artefacts.

lrvick 34 days ago

The qwen models not only have good OCR, they will describe pictures to you.

rurban 28 days ago

They not not only describe pictures. They can analyze pictures. Detect anomalies. Create 3d models out of it.

mapt 34 days ago

Anyone wanna do a quick offline MVP on a general vision assistant for the blind? We've had things like Google Lens for a while, but it's a bit vision and touchscreen-centric.

woctordho 34 days ago

API for Mythos and GPT Cyber are circulating in the market (That's also why we can use Claude and GPT in China). The open source community has been advancing subscription engineering for a long time, and I don't think Anthropic or OpenAI have any technical advantage in this field.

JonGretarB 34 days ago

Huh? Why would Apple not want you to be able to run local models? They have very deliberately stayed the hell away from this space.

ubercore 34 days ago

The conspiracy angle here is not really relevant. Ram is expensive and they're gearing up for M5 studios. Not the illuminati keeping better LLM models out of your hands.

lkjdsklf 34 days ago

They did decrease the memory bandwidth for.... reasons... which didn't make much sense.. but yeah this is some pretty weird conspiracy stuff.

Apple doesn't even sell a model. They just have a deal to use Googles. They can't "protect" their cloud version of a model they don't have.

raincole 34 days ago

You think Apple doesn't want you to use local models?

That's an interesting way to view the world. I mean, utterly stupid as it is, but interesting.

But the previous sentence is even stupider (a Perl script 10 years ago could write code like Qwen does now?), so I guess at least it's consistent.

digitaltrees 35 days ago

I built my own IDE and run my own model specifically to have private agentic coding. I can still access model APIs but I can be purely local if I want too. It’s amazing.

manmal 35 days ago

Curious, why did Zed with ACP not work for you?

digitaltrees 34 days ago

Because I wanted the full ide on my iPhone so I can code while away from my laptop doing fun stuff with my kids. And I don’t like the Claude codex fire and forget approach.

The ide I built has a full terminal, file system, git integration and AI agent. It uses a private cloud Linux container that is persistent so I can install packages and do anything I want from any phone, computer or browser. It’s amazing that we live in a time where we can build custom software for ourselves just for fun. I will never have to worry about cursor or vs changing getting bought and moth balled like Atom (my favorite ide). I now own my tool and will forever.

fud101 33 days ago

Literally will break overnight when some key dependency changes. Your LLM might not be able to fix it. Then i guess you regenerate it all from scratch? Sounds exhausting tbh.

digitaltrees 33 days ago

I’ve built enterprise software for 10 years with multiple upgrades over that time. With good test coverage and the right abstractions maintenance is feasible.

Also, because I wrote and own the code I don’t have to update if I don’t want to. I could choose instead to build around the dependency. That’s much more control over than when Microsoft bought GitHub and destroyed the Atom ide which I loved in favor of vscode which I still hate

Fokamul 34 days ago

I'm just guessing, but IDE which is using 3D acceleration just for stupid UI to run "smoothly", that is ridiculous.

Who runs IDE with LLM agents accessing your local filesystem, on bare metal?

Or am I alone to run everything LLM related on my VM just for development work. Then because of ZED genius decision, you need to share your GPU to VM, then some important features will not work, like snapshots. So you also need workaround for this, etc.

Too much hassle, Zed is not for me.

But I'm anti-Apple, so maybe that's the reason :)

Btw, even "ImHex" devs realized this and they're providing version without acceleration for VM use. They're using ImGui. Using it for local desktop app UI is also ridiculous, imho. Whatever.

oslem 34 days ago

I would imagine running a local LLM for development isn’t as popular as using a hosted provider. I don’t personally host a local model, but I have shared GPUs and storage volumes with VMs and I didn’t see it as that much of a hassle. What kinds of problems are you running into?

Doesn’t ghostty also use graphics acceleration? I was under the impression that rendering text is a relatively challenging graphics compute task.

digitaltrees 34 days ago

I run local LLM on my MacBook together with frontier models for different tasks. I am in the process of setting up a 3 Mac studio system to serve AI to my team.

hedora 34 days ago

What's wrong with using a 3d accelerator and falling back to CPU graphics if needed? Pixels / joule is orders of magnitude better on an iGPU than on the CPU. (Which can matter over a 8-12 hour editing session, maybe.)

zozbot234 34 days ago

Modern IDEs don't use 3D at all, nor do they use the sprite-like 2D graphics that GPUs excel at and that can accelerate, e.g. mobile touch- and swipe-based UX. The main thing they do is font rendering, and accelerating that on GPU while keeping visual quality unchanged is quite complicated. The graphics pipeline doesn't really help all that much.

manmal 34 days ago

Agents are read-only per default in Zed. You should really get off your high horse.

DrewADesign 35 days ago

Multiple gazillion dollar companies each seem to be spending to ensure that they alone pretty much dominate all knowledge work, with customers eating up their tokens like Cookie Monster. I wonder if the any of them could survive as LLM providers if they not only failed to do that, but the entire industry ended up selling what the current Cookie Monster would call a “sometimes snack,” for very special occasions?

datadrivenangel 35 days ago

In my experience once you get to ~30 gigs of ram for a model like Gemma4, the rest of the 128g of memory is simply nice to have. The speed and costs are what make it tough though, because its slower and more expensive than the same model served on a big accelerator card, and is going to be worse than a frontier model.

digitaltrees 35 days ago

I wonder if it really needs to be worse. I am playing with the idea of fine tuning a model on my exact stack and coding patterns. I suspect I could get better performance by training “taste” into a model rather than breadth.

epicureanideal 35 days ago

I also wonder about JS only, Python only, etc models.

Maybe the future is a selection of local, specific stack trained models?

robrenaud 34 days ago

There is some recent work on modularizing knowledge in LLMs.

https://arxiv.org/html/2605.06663v1

It might be possible to train a big generalist that is a composition of modules, some of which can be dropped dynamically at inference time, depending on the prompt.

digitaltrees 29 days ago

Cool. Thanks for sharing. I am thinking about creating a series of smaller models for specific purposes and then orchestrating them so they mirror the human brain which is a bunch of subsystems that give multiple opinions about the same stimulus

shailendra_sis 29 days ago

Interesting direction. I’ve also been thinking about modular / subsystem-based approaches for specialized tasks in small AI systems.

andy_ppp 35 days ago

These models being able to generalise at coding will likely get worse if you remove high quality training data like all of python.

jimbokun 34 days ago

That approach has its advantages, but sometimes I want to generate code for a language or kind of project I’m not experienced with using the accepted best practices.

andy_ppp 35 days ago

Fine tuning these models (at least with PPO or equivalent) requires even more VRAM than inference does, potentially 2-3 times more.

rusk 34 days ago

You could use PEFT? Operating on only a subset of weights is fairly standard practice nowadays …

andy_ppp 34 days ago

Yes I used LoRA and it’s fine but I’m not convinced the model doesn’t end up more stupid and less general

ElectricalUnion 33 days ago

You need the rest of the ram for the context. If you don't want to end up with a toy context or quantized lossy context, is pretty easy to end up having to spend up 50+GB just for the KV cache, per simutaneous inference slot.

sanderjd 34 days ago

Are there any harnesses that are attempting to optimize for using local models like this? Unsurprisingly, my naive attempts to integrate with harnesses designed for frontier models have gone poorly. But it seems like a harness that understands the capabilities and limitations better could perform significantly better.

fennecfoxy 34 days ago

>It's here, right now.

I mean I've been forcing my good old 1080ti to run local models since a short while after llama was first leaked.

But I wouldn't say "local models are here" in the same way as "year of the Linux desktop!111"

Until someone can just go out and buy some sort of "AI pod" that they can take home, plug in and hit one button on a mobile app to select a model (or even just hide models behind various personas) then I wouldn't say it's quite there yet.

It's important that the average consumer can do it, I think the limitations for that are: things are changing too quickly, ram+compute components are exceedingly expensive now, we're still waiting on better controls/harnesses for this stuff to stop consumers not just from shooting themselves in the foot, but blowing their foot clean off.

Would be interesting to see a Taalas-like chip in a product, albeit there's so many changes going on atm with diffusion based models, Google's Turboquant (which as someone who has had to almost always run quantized models, makes a lot of sense to me).

skillina 34 days ago

What is the use case you see for non-technical users self-hosting? I think it’s important that tools remain available but I don’t expect it to be adopted by “average consumers.”

I’m interested in self-hosting for privacy and control. I already owned the hardware I’m testing with, so my spend is limited to time and electricity.

The “LLM pods” you describe will be loaded with spyware and adware (see: Smart TVs), and average consumers won’t max their compute around the clock so naturally data centers are able to make more efficient use of hardware by maximizing utilization.

fennecfoxy 34 days ago

Agree with your point on them being loaded up with spyware etc because that's just how it is now I suppose.

In terms of maximising compute I kind of agree but also kinda not - people's laptops and phones aren't burning at 100% 24/7 either. Sure AI requires so much more compute...but not _that_ much more, especially as technology marches on.

For the general use case; I could be wrong but I'd see it sort of like a GPU/NAS/etc. "Pay once" rather than a subscription (to a service offered by a datacenter).

But tbf, the way things are now _is_ all subscription models and consumers just kinda let it happen. I would love to be able to pay a one-off fee for lightroom...but I can't because they want a subscription to "pay for all the updating we're doing". They barely update shit.

kelnos 34 days ago

And on top of that, I'm sure the "LLM pod" will still be sold on a subscription model so you get model updates etc.

But I wish we could actually have nice things. I imagine there's a niche for a middle ground: a privacy-preserving device that uses local-only models and doesn't spy on the user, and sells for a one-time payment with no subscription. It'll be expensive, though, likely more expensive than using a cloud-hosted model.

cl0ckt0wer 34 days ago

There are local ai pods. They're like 2k for a low end.

yieldcrv 35 days ago

I need to see these proper harnesses

I tried oMLX and OpenCode a few weeks ago and the 65k context window was useless, it tried to analyze a very small codebase before going full on agentic and ran out of context window immediately

I don't have time to tweak 1,000 permutations of settings just re-prove that its not as smart as Opus 4.6

I need out the box multimodal behavior as similar as typing claude in the command line and its so not there yet

but I'm open to seeing what people's workflows are

cyberax 35 days ago

I'm playing with a tape drive for backups, so I asked a local model to rewrite LTFS ( https://github.com/LinearTapeFileSystem/ltfs ) in Go.

I gave it the reference C implementation, the LTFS spec from SNIA, and asked it to use the C implementation to verify the correctness of the Go code.

LTFS is a pretty straightforward spec, so it made a very reasonable port within about 2 days. It's now working on implementing the iSCSI initiator (client) to speak with my tape drive directly, without involving the kernel.

Edit: the model is Qwen3.6-35B

phamilton 35 days ago

I'm running opencode with qwen3.6-35b-a3b at a 3-bit quant. I also have qwen3.5-0.8b used for context compaction. I run with 128k context.

It's usable. I set it loose on the postgres codebase, told it to find or build a performance benchmark for the bloom filter index and then identify a performance improvement. It took a long time (overnight), but eventually presented an alternate hashing algorithm with experimental data on false positive rate, insertion speed and lookup speed. There wasn't a clear winner, but it was a reasonable find with rigorous data.

Balinares 35 days ago

Do you encounter looping issues at such low quants? How do you deal with those?

nullsanity 35 days ago

Hey man, you can just say "I'm lazy, so I'm staying with the cloud. if I wanted to use my brain, I wouldn't be using AI, gosh" - it's much shorter.

fennecfoxy 34 days ago

Personal attacks are against the rules, by the way.

yieldcrv 34 days ago

all the money and clout is in considering people’s reported problems as valid and solving them

so when I encounter a common but invalidated friction, I explain it like I’m 5, understanding that many of the engineering and entrepreneurial problem solvers have the emotional intelligence of a 5 year old

jimbokun 34 days ago

Has anyone tried to calculate the break even cost of buying a PC to run an LLM locally, vs the amount of tokens you could get from an AI provider?

zozbot234 34 days ago

The basic answer: very much not worth it at face value, becomes arguably worth it once you start worrying about future rug pulls from the big AI providers. (And that does include the market for third-party inference, at least at present.) It's also worth it if you have existing hardware to repurpose, but that's obvious and not what you were asking about.

thot_experiment 34 days ago

Also you can feed it ALL of your data willy nilly without ever worrying about safety because you can just do it with the LAN cable unplugged, for applications that demand data hygiene it's a cheat code that guarantees safety without any sort of data sanitization.

nsvd2 34 days ago

I run Gemma locally on a 3090, it's amazing how useful it is to be able to call out to ollama in a bash script or cron job.

winocm 35 days ago

Perhaps I am the odd one out here, but a small part of me wants to see what happens when you run a proprietary SOTA model on a laptop.

pianopatrick 35 days ago

Currently I'm testing something like this just to see what happens. I have an old laptop with 4GB of RAM. I attached a USB drive with Gemma 4 31B model (which is 32.6 GB). Currently the laptop is running llama.cpp and trying to respond to a prompt by streaming the model from disk.

The USB drive light is flickering, showing something is happening. It's been about 8 hours since I entered the prompt and I've gotten about 10 tokens back so far. I'm going to leave it running overnight and see what happens.

zozbot234 34 days ago

Wow, that's a true worst case scenario especially if the USB is just plain old USB 2.0 (max 480 Mbps) and/or if the drive is a spinning disk. How's the CPU doing, though? Is there any headroom given the USB bottleneck?

pianopatrick 34 days ago

running top shows the process llama-cli taking 29% of CPU and 88% of memory, while process usb-storage is taking 9% of cpu and 0% of memory

stuaxo 34 days ago

Nice.

What did you use to do this, something standard like llamacpp or something else like vllm or your own contraption ?

pianopatrick 34 days ago

llama.cpp

It's now spit out about 40 tokens after maybe 18 hours and has not finished the "thinking" stage of responding to the prompt. I'll let it keep running to see what happens

SilentM68 35 days ago

Not sure if this is exactly the scenario you envision but I run ComfyUI on an Acer Helio 300 laptop, from four years ago. Has 16GB RAM, NVIDIA GeForce RTX 2060 w/6144MiB of VRAM and have generated a few images using "NetaYumev35_pretrained_all_in_one.safetensors" @ 10.6GB checkpoint, (well beyond the 6GB capacity of the RTX 2060 card). That being said, it takes more than 10 minutes to complete the task. Of course, I have to turn off all other apps, and browser tabs or hibernate them. If I don't, the laptop's fans begin to spin up like an airplane propeller. It's worth mentioning that I've tried to do this with other IDEs and all seem to fail with some error or another, usually out of VRAM issue. I've only gotten it to work with ComfyUI.

I use an anaconda environment, though would have preferred an "uv" environment, on Linux and automate the startup sequence using the following script (start_comfy.sh) from the term rather than manually starting the environment from same said term:

#!/bin/bash

#

# temporary shell version

eval "$(conda shell.bash hook)"

conda activate comfy-env

comfy launch -- --lowvram --cpu-vae

Here are some of the images: https://imgbox.com/nqjYhdx3 https://imgbox.com/93vSWFic https://imgbox.com/qs1898dz

I'm hesitant to increase the sizes of the renders as that will surely stress my laptop's components.

t_mahmood 35 days ago

I'm not running local for exactly the same reason, to not stress my components. As it seems we are in for a long haul due to this AI bubble (can't wait for it to pop) so need to make sure I survive this madness, as for sure I can't afford to replace anything right now.

SilentM68 34 days ago

I don't know that any AI bubble will pop. AI can be used to accelerate therapies, cures, make scientific advancements. Add to that, quantum science technology which if successful, should accelerate things, depending on who's the one at the wheel. Problem is the gap between now and then (e.g. age abundance). It's going to be a difficult road for good number of the population until that day comes. I'm scouting potential locations of bridges, to live under, so that I can find and claim one when homeless day arrives.

I can't help but feel that companies using AI, engaging in employee layoffs, are shooting themselves in the foot. The endgame for them will be zero profits, since displaced workers translates to no money to pay for goods and services :|

306bobby 33 days ago

Both the bubble popping and it's legitimate use cases can exist at the same time.

For example, the www bubble popped, but the Internet didn't go away

SilentM68 33 days ago

True

woctordho 34 days ago

I'm using ROG Phantom laptop with Strix Halo iGPU that has a whopper of 128 GB VRAM. Next year there will be the rumored Medusa Halo with 256 GB VRAM, which is more than enough to run DeepSeek V4 Flash.

kelnos 34 days ago

I don't think you're the odd one out. I would be very curious to try to run Opus 4.7 on a (high end) laptop. I'd also like to see how it runs on a high-end workstation rig built for it.

amelius 35 days ago

You burn your lap?

reisse 35 days ago

Nothing special?

I mean, inference engine might need to get some tweaks, to support whatever compute is available. But then, if you put a few terabytes of disk for swap, and replace RAM to bigger sticks if possible, it should work? Slowly, of course, but there is no reason it should not to.

reverius42 35 days ago

The big difference will be measuring seconds per token instead of tokens per second.

martijnvds 35 days ago

Seconds per token is just fractional tokens per second ;)

degamad 35 days ago

> fractional

Reciprocal?

yfw 35 days ago

You can if you have enough ram slots?

dust1n 35 days ago

Can you share how you use it to categorize trip photos!

Farmadupe 34 days ago

I'm not sure there's a one-stop shop for this at the moment. I think the process is:

* Have a box with sufficient spare (V)RAM -- probably 8G for simple categorization with qwen3.5-4b, and 24G or more for more intelligent categorization with qwen3.6-27b or gemma4-31b.

* Download or compile llama.cpp. Choose a model, then choose one of the "quantized" builds that will actually fit on your hardware. There are literally hundreds to thousands of these per model on Hugging Face.

* Spend half a day tuning command-line parameters until llama.cpp doesn't crash.

* Watch llama.cpp regularly OOM itself, then put it in a systemd service with a memory limit so it doesn't take the entire machine down when it dies.

* Download all your photos to a folder.

* Start vibing a Python script to categorize your images by repeatedly prompting the LLM with each image in turn.

* Spend days tweaking/refining the prompt to try to get the LLM to actually do what you want.

The endgame is one of:

* The local model categorizes your images. Yay.

* The local model is too slow and you give up. Boo.

* The local model is too slow, so you spend $1k-$10k on hardware. Your image categorization task becomes a cover story for buying new gear. Yay.

* The local model can't understand your categorization metric, so you give up. Boo.

* You eagerly await news of the next open model being released. Yay?

* You consider replacing your local model with a frontier model, but then you realize you'd be spending $500 to categorize your photos. Boo.

* You refuse to allow Google/Gemini/Anthropic to train on your nudes. Boo.

creativeSlumber 34 days ago

this is one of the most popular options. Self hosted. https://immich.app/

Mario9382 34 days ago

I'm also interested on how to do this

antidamage 35 days ago

This is my exact setup as well and dear lord gemma is absolutely batshit insane. I'm trying to get a self-reflection and confidence loop going now, but it does feel like it's not the local resources, it's the limits of the training. Dedicated coding or dedicated real-world task models would be a good optimisation.