I don't think they're intended for rack usage like that. More like for people to put under their desks... there would be no reason to build the giant case with fancy silent-ish cooling if you're going to put them next to your other jet engines.
Fully agree, and I think the tinybox is great if you put only one of them somewhere in your local office.
I just don't think it makes sense to connect multiple of them into a "cluster" to work with bigger models, as the networking bandwidth isn't good enough and you'd have to fit multiple of these big boxes into your local space. Then I might as well put up a rack in a separate room.
there's an ocp 3.0 mezzanine, so no need to remove a card and you'd get 200gbps, unless I've missed something about needing to remove a card to access it. But yeah stacking these or racking them seems less than ideal.
I don't see why 6 is inherently worse than 4 or 8, not all of the layers are exactly equal or a power of 2 in count. 2^2, 2^3, vs 2^1*3^1 might give you more options.
The main issue I run into mainly is flops vs ram in any given card/model.
Usually you want to split each layer to run with tensor parallelism, which works optimally if you can assign each kv head to a specific GPU. All currently popular models have a power of 2 number of kv heads.