Hacker News new | ask | show | jobs
by uberdavid 57 days ago
Author here. DeepSeek-V4 replaced multi-domain RL with a specialist-then-distill pipeline: train domain experts independently, merge through on-policy distillation. This post connects that production decision to three recent papers (Neural Thickets, Sparse but Critical, Apple's SSD) that together suggest pretrained LLMs already contain dense neighborhoods of task-specific experts, and post-training is just navigating to the right one.

Curious whether anyone has tried specialist-then-merge pipelines at smaller scale, or whether the anti-correlation of experts that Neural Thickets observes holds up in practice when you're fine-tuning for production use cases.