Hacker News new | ask | show | jobs
by olq_plo 397 days ago
Very cool idea. Can't wait for converted models on HF.
1 comments

deepseek-v2,v3,r1 are all using multi-headed attention.