| I’ve been experimenting with running small LLMs directly on mobile hardware (low-range Android devices), without relying on cloud inference. This is a summary of what worked, what didn’t, and why. Cloud-based LLM APIs are convenient, but come with: -latency from network round-trips
-unpredictable API costs
-privacy concerns (content leaving device)
-the need for connectivity For simple tasks like news summarization, small models seem “good enough,” so I tested whether a ~270M parameter model gemma3-270m could run entirely on-device. Model - Gemma3-270M INT8 Quantized
Runtime - Cactus SDK (Android NPU/GPU acceleration)
App Framework - Flutter
Device - Mediatek 7300 with 8GB RAM Architecture
- User shares a URL to the app (Android share sheet).
- App fetches article HTML → extracts readable text.
- Local model generates a summary.
- device TTS reads the summary.
Everything runs offline except the initial page fetch. Performace
- ~450–900ms Latency for a short summary (100–200 tokens).
- On devices without NPU acceleration, CPU-only inference takes 2–3× longer.
- Peak RAM: ~350–450MB Limitation
-Quality is noticeably worse than GPT-5 for complex articles.
-Long-form summarization (>1k words) gets inconsistent.
-Web scraping is fragile for JS-heavy or paywalled sites.
-Some low-end phones throttle CPU/GPU aggressively. | Metric | Local (Gemma 270M) | GPT-4o Cloud |
| ------- | -------------------- | -------------------- |
| Latency | 0.5–1.5s | 0.7–1.5s + network |
| Cost | 0 | API cost per request |
| Privacy | Text stays on device | Sent over network |
| Quality | Medium | High | Github - https://github.com/ayusrjn/briefly Running small LLMs on-device is viable for narrow tasks like summarization. For more complex reasoning tasks, cloud models still outperform by a large margin, but the “local-first” approach seems promising for privacy-sensitive or offline-first applications.
Cactus SDK does a pretty good job for handling the model and accelarations. Happy to answer Questions :) |