Hacker News new | ask | show | jobs
by areddyyt 643 days ago
The non-linear layers, particularly the softmax(QK^T), will be crucial to getting ultra-low latency and high throughput. We're considering some custom silicon just for that portion of every transformer block