|
|
|
|
|
by mohctp
3479 days ago
|
|
The RMA process is pretty much a moot point at this scale. It simply doesn't make sense to buy servers with large warranties. Save the money, stock spares, and when a component dies replace it. In the long run it will come out to cost a lot less. I'd much rather pay much less knowing a component may fail here and there and you buy a new one. As far as moving from cloud to bare metal, another thing to take into consideration (that this was replied to), if you don't architect your AWS (or other cloud solution) to take advantage of multiple geographic regions, the cloud won't benefit you. I 100% agree that there should be more than one region deployed for this service. As others have said, all it takes is 1 event and the site will be down for days to weeks to months (It may not happen often, but when it does, you go out of business). The size and complexity of this infrastructure will make it nearly impossible to reproduce on short order in a new facility. If I were the lead on this I would have either active / active sites, or an active / replica site. I would also have both local (fast restores), and off-site backups of all data. A replica site protects against site failure not data loss and point-in-time recovery. |
|
Yep, this is why scaling starts with scalable distributed design. We were moving a fairly large logging stack from NFS to S3 once, for the same reason Gitlab is trying to move to bare metal now. Moving off cloud was not an option, moving to a TCO efficient service was. NFS did not scale and there was the latency problem. I think moving to bare metal cannot help with scale as much as a good architecture can. We will see how deep the datacenter hole goes. :)