Hacker News new | ask | show | jobs
by MichaelMoser123 394 days ago
deepseek-v2,v3,r1 are all using multi-headed attention.