In both cases this is a prime opportunity for anyone to disrupt Nvidia. They are in this market position in large part because both video games and neural networks do a lot of highly parallel floating point math, especially matrix multiplication. This model architecture doesn't do any of that.
Of course it should be fairly simple for Nvidia to add special silicon and instructions for two-bit addition to a future generation of their cards. But it'll take a while because they already have a roadmap and preexisting commitments. And any competitor doesn't have to copy everything Nvidia does to make floating point numbers go fast, they can just focus on making two-bit data handling and addition go fast.
Yes, but with their current market cap, the more likely result is they acquire one of the several competitors poised to take advantage of this and throw massive resources behind them.
BF16 is a pretty big unit in an ASIC - You need at least 9 * 5 gates to calculate the exponent of the result, a 10 bit barrel shifter (10*10 + 10*ceil(log2(10)) gates), and a 10 bit multiplier (approximately 10 * 10 * 9 gates)
Total = 1085 gates. The reality is probably far more, because you're going to want to use carry-look-ahead and pipelining.
Whereas 1 bit multiplies and add's of say a 16 bit accumulator use... 16 gates! (and probably half since you can probably use scheduling tricks to skip past the zero's, at the expense of variable latency...)
So when 1 bit math uses only 1/100th of the silicon area of 16 bit math, and according to this paper gets the same results, the future is clearly silicon that can do 1 bit math.
- we have llama.cpp (could be enough or at least as mentioned in the paper a co-processor to accelerate the calc can be added, less need for large RAM / high end hardware)
- as most work is inference, might not need for as many GPUs
- consumer cards (24G) could possibly run the big models