It looks like Llama 2 7B took 184,320 A100-80GB GPU-hours to train[1]. This one says it used a 96×H100 GPU cluster for 2 weeks, for 32,256 hours. That's 17.5% of the number of hours, but H100s are faster than A100s [2] and FP16/bfloat16 performance is ~3x better.
If they had tried to replicate Llama 2 identically with their hardware setup, it'd cost a little bit less than twice their MoE model.