Hacker News new | ask | show | jobs
by nicohayes 263 days ago
Could you clarify whether the 2B active parameter concept refers to per-token inference and how this scales with context length? Specifically how MoE affects activation during inference and any practical implications for latency.