I'm not an ops guy, but I know it was a constant source of trouble in our team and a large challenge to keep it running smoothly. For us it took 2-3 experienced engineers something like 2 years to have a stable and smoothly running production environment at scale.
If you start from scratch today things may have gotten better. You might want to look at https://github.com/overhangio/tutor, I know Régis has been hard at work making it easier to run Open edX.