|
|
|
|
|
by zozbot234
121 days ago
|
|
I guess it depends what you mean by "perf". If you optimize everything for the absolutely lowest latency given your power budget, your throughput is going to suck - and vice versa. Throughput is ultimately what matters when everything about AI is so clearly power-constrained, latency is a distraction. So TPU-like custom chips are likely the better choice. |
|
I disagree. Yes it does matter, but because the popular interface is via chat, streaming the results of inference feels better to the squishy messy gross human operating the chat, even if it ends up taking longer. You can give all the benchmark results you want, humans aren't robots. They aren't data driven, they have feelings, and they're going to go with what feels better. That isn't true for all uses, but time to first byte is ridiculously important for human-computer interaction.