| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by LuxBennu 113 days ago
	I run whisper large-v3 on an m2 max 96gb and even with just inference the memory gets tight on longer audio, can only imagine what fine-tuning looks like. Does the 64gb vs 96gb make a meaningful difference for gemma 4 fine-tuning or does it just push the oom wall back a bit? Been wanting to try local fine-tuning on apple silicon but the tooling gap has kept me on inference only so far.

3 comments

weitendorf 112 days ago

Hey I was literally just working on this today (I was racing ahead on an audio FT myself but OP beat me by a few hours). For audio inference definitely try running your input through VAD first to drop junk data and if necessary, as one of several preprocessing steps before sending the audio to the large model. You can check out how I did it here: https://github.com/accretional/vad/blob/main/pkg/vad/vad.go

I was using https://huggingface.co/onnx-community/pyannote-segmentation-... because with ONNX, I could run it on Intel servers with vectorized instructions, locally on my Mac, AND in-browser with transformers.js

VAD is absurdly time-effective (I think like O(10s) to segment 1hr of audio or something) and reduces the false positive rate/cost of transcription and multimodal inference since you can just pass small bits of segmented audio into another model specializing in that, then encode it as text before passing it to the expensive model.

link

MediaSquirrel 112 days ago

Great minds think alike!

Also, I had a huge head start, as I spent a month or two working on this in September 2025, shelved it and dusted it back off this weekend.

link

weitendorf 112 days ago

Excellent work still, your repo is much more robust and fleshed out and I am just beelining straight to audio LoRa not really knowing what I'm doing, as this is my first time attempting a ~real ML training project.

I think in https://github.com/mattmireles/gemma-tuner-multimodal/blob/m... and https://github.com/mattmireles/gemma-tuner-multimodal/blob/m... and https://github.com/mattmireles/gemma-tuner-multimodal/blob/m... you have a superset of the various cludges I have in my finetuning repo, I'm going to study this and do what I can to learn from it. Really appreciate you sharing it here!

Definitely interested in swapping notes if you are though. Probably the biggest thing that came out of this exercise for us was realizing that Apple actually has some really powerful local inference/data processing tools available locally, they just are much more marketed towards application developers so a lot of them fly under the radar.

We just published https://github.com/accretional/macos-vision to make it easy for anybody to use Apple's local OCR, image segmentation, foreground-masking, facial analysis, classification, and video tracking functionality accessible via CLI and hopefully more commonly in ML and data workloads. Hopefully you or someone else can get some use of it. I definitely will from yours!

link

MediaSquirrel 112 days ago

Look inside here: https://github.com/mattmireles/gemma-tuner-multimodal/tree/m...

Here’s the trick: use Gemini Pro deep research to create “Advanced Hacker’s Field Guide for X” where X is the problem that you are trying to solve. Ask for all the known issues, common bugs, unintuitive patterns, etc. Get very detailed if you want.

Then feed that to Claude / Codex / Cursor. Basically, create a cheat sheet for your AI agents.

This will unlock a whole new level of capability.

I’m @mattmireles on Twitter — feel free to DM me.

link

MediaSquirrel 113 days ago

Memory usage increases quadratically with sequence length. Therefore, using shorter sequences during fine-tuning can prevent memory explosions. On my 64GB RAM machine, I'm limited to input sequences of about 2,000 tokens, considering my average output for the fine-tuning task is around 1,000 tokens (~3k tokens total).

link

LuxBennu 113 days ago

Ah that makes sense, quadratic scaling is brutal. So with 96gb i'd probably get somewhere around 4-5k total sequence length before hitting the wall, which is still pretty limiting for anything multimodal. Do you do any gradient checkpointing or is that not worth the speed tradeoff at these sizes?

link

MediaSquirrel 112 days ago

Haven’t tried yet. That’s on the do list. But good suggestion.

link

zozbot234 112 days ago

Shouldn't FlashAttention address the quadratic increase in memory footprint wrt. fine-tuning/training? I'm also pretty sure that it does not apply to pure inference due to how KV-caching works.

link

MediaSquirrel 112 days ago

re: Whisper v3 -- how is this possible? Whisper has a 30s context window. You have to chunk it.

link

LuxBennu 112 days ago

Yeah sorry that was unclear on my part. I chunk at the endpoint level, whisper itself obviously processes 30s windows. The memory/latency thing I was referring to is more about processing longer files end to end through the pipeline, not a single whisper pass. My fastapi wrapper just splits the audio and runs chunks sequentially so total wall time scales linearly with file length, nothing fancy.

link

sipjca 112 days ago

Wondering similar. It certainly can run beyond 30 seconds but at some point I believe the output should degrade

Plus you could do actual batch inference instead. Or if you must carry forward the context you could still do it linearly, but the mem usage shouldn’t just explode

link