Hacker News new | ask | show | jobs
by winocm 32 days ago
Perhaps I am the odd one out here, but a small part of me wants to see what happens when you run a proprietary SOTA model on a laptop.
7 comments

Currently I'm testing something like this just to see what happens. I have an old laptop with 4GB of RAM. I attached a USB drive with Gemma 4 31B model (which is 32.6 GB). Currently the laptop is running llama.cpp and trying to respond to a prompt by streaming the model from disk.

The USB drive light is flickering, showing something is happening. It's been about 8 hours since I entered the prompt and I've gotten about 10 tokens back so far. I'm going to leave it running overnight and see what happens.

Wow, that's a true worst case scenario especially if the USB is just plain old USB 2.0 (max 480 Mbps) and/or if the drive is a spinning disk. How's the CPU doing, though? Is there any headroom given the USB bottleneck?
running top shows the process llama-cli taking 29% of CPU and 88% of memory, while process usb-storage is taking 9% of cpu and 0% of memory
Nice.

What did you use to do this, something standard like llamacpp or something else like vllm or your own contraption ?

llama.cpp

It's now spit out about 40 tokens after maybe 18 hours and has not finished the "thinking" stage of responding to the prompt. I'll let it keep running to see what happens

Not sure if this is exactly the scenario you envision but I run ComfyUI on an Acer Helio 300 laptop, from four years ago. Has 16GB RAM, NVIDIA GeForce RTX 2060 w/6144MiB of VRAM and have generated a few images using "NetaYumev35_pretrained_all_in_one.safetensors" @ 10.6GB checkpoint, (well beyond the 6GB capacity of the RTX 2060 card). That being said, it takes more than 10 minutes to complete the task. Of course, I have to turn off all other apps, and browser tabs or hibernate them. If I don't, the laptop's fans begin to spin up like an airplane propeller. It's worth mentioning that I've tried to do this with other IDEs and all seem to fail with some error or another, usually out of VRAM issue. I've only gotten it to work with ComfyUI.

I use an anaconda environment, though would have preferred an "uv" environment, on Linux and automate the startup sequence using the following script (start_comfy.sh) from the term rather than manually starting the environment from same said term:

#!/bin/bash

#

# temporary shell version

eval "$(conda shell.bash hook)"

conda activate comfy-env

comfy launch -- --lowvram --cpu-vae

Here are some of the images: https://imgbox.com/nqjYhdx3 https://imgbox.com/93vSWFic https://imgbox.com/qs1898dz

I'm hesitant to increase the sizes of the renders as that will surely stress my laptop's components.

I'm not running local for exactly the same reason, to not stress my components. As it seems we are in for a long haul due to this AI bubble (can't wait for it to pop) so need to make sure I survive this madness, as for sure I can't afford to replace anything right now.
I don't know that any AI bubble will pop. AI can be used to accelerate therapies, cures, make scientific advancements. Add to that, quantum science technology which if successful, should accelerate things, depending on who's the one at the wheel. Problem is the gap between now and then (e.g. age abundance). It's going to be a difficult road for good number of the population until that day comes. I'm scouting potential locations of bridges, to live under, so that I can find and claim one when homeless day arrives.

I can't help but feel that companies using AI, engaging in employee layoffs, are shooting themselves in the foot. The endgame for them will be zero profits, since displaced workers translates to no money to pay for goods and services :|

Both the bubble popping and it's legitimate use cases can exist at the same time.

For example, the www bubble popped, but the Internet didn't go away

True
I'm using ROG Phantom laptop with Strix Halo iGPU that has a whopper of 128 GB VRAM. Next year there will be the rumored Medusa Halo with 256 GB VRAM, which is more than enough to run DeepSeek V4 Flash.
I don't think you're the odd one out. I would be very curious to try to run Opus 4.7 on a (high end) laptop. I'd also like to see how it runs on a high-end workstation rig built for it.
You burn your lap?
Nothing special?

I mean, inference engine might need to get some tweaks, to support whatever compute is available. But then, if you put a few terabytes of disk for swap, and replace RAM to bigger sticks if possible, it should work? Slowly, of course, but there is no reason it should not to.

The big difference will be measuring seconds per token instead of tokens per second.
Seconds per token is just fractional tokens per second ;)
> fractional

Reciprocal?

You can if you have enough ram slots?