Hacker News new | ask | show | jobs
by nickpsecurity 877 days ago
It's strange because supercomputing centers have long built compute and storage in parallel to address this problem. Older companies like SGI had the storage accessing the high-speed, low-latency interconnect. Others build clusters with different nodes for each.

Companies that can train models this big should hire people with HPC experience. They'd point out the need for storage clusters with high-speed interconnects. If they lack storage capabilities, I wonder why they're doing HPC like that. They clearly need the storage.

Example that BLOOM was trained on lists 100+GB of RAM per node and PB's of storage:

http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-hw-eng.htm...

1 comments

And it's gotten easier. On AWS where the paper was done, you can very easily get managed lustre or use S3 which can achieve very high bandwidth.