| Hey Nico, Very cool to hear your perspective in how you are using the small LLMs! I’ve been experimenting extensively with local LLM stacks on: • M1 Max (MLX native) • LM Studio (GLM, MLX, GGUFs) • Llama.cp (GGUFs) • n8n for orchestration + automation (multi-stage LLM
workflows) My emerging use cases:
-Rapid narration scripting
-Roleplay agents with embedded prompt personas
-Reviewing image/video attachments + structuring copy for clarity
-Local RAG and eval pipelines My current lineup of small LLMs (this changes every month depending on what is updated): MLX-native models (mlx-community): -Qwen2.5-VL-7B-Instruct-bf16 → excellent VQA and instruction following -InternVL3-8B-3bit → fast, memory-light, solid for doc summarization -GLM-Z1-9B-bf16 → reliable multilingual output + inference density GGUF via LM Studio / llama.cpp: -Gemma-3-12B-it-qat → well-aligned, solid for RP dialogue -Qwen2.5-0.5B-MLX-4bit → blazing fast; chaining 2+ agents at once -GLM-4-32B-0414-8bit (Cobra4687) → great for iterative copy drafts Emerging / niche models tested: MedFound-7B-GGUF → early tests for narrative medicine tasks X-Ray_Alpha-mlx-8Bit → experimental story/dialogue hybrid llama-3.2-3B-storyteller-Q4_K_M → small, quick, capable of structured hooks PersonalityParty_saiga_fp32-i1 → RP grounding experiments (still rough) I test most new LLMs on release. QAT models in particular are showing promise, balancing speed + fidelity for chained inference.
The meta-trend: models are getting better, smaller, faster, especially for edge workflows. Happy to swap notes if others are mixing MLX, GGUF, and RAG in low-latency pipelines. |