We forward a cluster of 2,560 TPU pod cores from our GCE project to other GCE projects in europe-west4-a. Originally it was because we had a separate GCE project with a bunch of credits, but that project had no access to TPUs. The question was, could we still take advantage of the credits? It turns out, we could; the solution involved VPC Network Peering, which I later learned is how the TPUs themselves work. Some configuration details are here: https://www.shawwn.com/swarm#iptables
Nowadays we forward the TPU pods to pretty much anyone who wants to try them out, in hopes of getting more people involved in the TPU programming scene. The TPUs are managed via a website (https://www.tensorfork.com/tpus) and we coordinate TPU access via spreadsheet. Each researcher has their own GCE project, and we simply flip a switch to give them access.
If anyone reading this happens to be into ML and into programming for big hardware rigs, feel free to hop into the Tensorfork discord server and we can show you the ropes. https://github.com/shawwn/tpunicorn#ml-community
I've done some dead simple forwarding/load balancing work, and if you can do it with nat instead of a proxy application it'll use a lot less memory, in addition to less cpu.
That means fewer load balancers needed, or smaller machines (or both). So I'd say that means anytime you run out of capacity on your proxy machines would be an opportunity to look for other techniques. Haproxy is probably easier to use though, and would tend to need less work to get the features you want, though. So there's an opex/capex vs development time argument.
Hyperscaling Haproxy is a lot of fun too, though. There's a huge difference in connections/second between a normal config and a totally tuned config with haproxy and kernel patching on the table.
Nowadays we forward the TPU pods to pretty much anyone who wants to try them out, in hopes of getting more people involved in the TPU programming scene. The TPUs are managed via a website (https://www.tensorfork.com/tpus) and we coordinate TPU access via spreadsheet. Each researcher has their own GCE project, and we simply flip a switch to give them access.
If anyone reading this happens to be into ML and into programming for big hardware rigs, feel free to hop into the Tensorfork discord server and we can show you the ropes. https://github.com/shawwn/tpunicorn#ml-community