The setup took a couple months to research, build, and make "perfect". However, ongoing it takes little time to maintain (less than an hour a week, if averaged). Every three years, I build new servers and place them into service (2 to 3 weeks of time to perform). I also do periodic hardware maintenance roughly every three months (typically 1/4 of a day to perform).
Due to the cost savings, we also are able to do quite a bit of redundancy, such as: dual PSUs, SSDs in RAID 10 on non-SAN servers, RAID-Z2/3 on SAN servers, offsite backups, complete server redundancy, spare servers ready to be slotted (I live an hour from colo), spare parts on hand, even multiple physical colos.
If components are selected carefully (i.e. sharing components between server roles), regular maintenance is performed, and redundancy is ensured on a per component, per server, and per datacenter level, it's not very time intensive or costly.
I am a software engineer by trade, but love the ins and outs of hardware/ops. As such, everything is automated and scripted (that can be). I can raise/move instances in minutes, just like EC2 (currently use XCP).
Even with the research, it still saves roughly 100k per 3 years.
These kinds of numbers scare the shit out of me. Here I am thinking a couple of linodes may cover what I want (am intrigued by uptano (https://uptano.com/) linked above though).
How do I go about estimating my real needs? I mean, I hope you are running some major stuff for money whereby you save $60K a year. Holy shit!
It speaks more towards the outlandish expense of EC2 (for us) than it does the true actual expense.
A few things:
Eliminate the middle men. Who do the small/medium datacenters use to build their custom hosting hardware? It's likely someone like Ma Labs where you get quite a savings over Amazon/Newegg, particularly when you buy components for many servers at a time.
Pay in advance, when it makes sense and is possible. Talk to colo operators, you can likely get a better deal if you pay for 1/3 years up front.
When you build the hardware yourself, you can do things no operator can do for you... tailor it exactly for your own domain needs.
Analyzing your domain needs to define server roles that you might need (e.g. load balancer, app server, relational database, hadoop cluster, nosql database, key/value stores like redis, etc) will lead you to commonalities in hardware/components needs. Now you can develop a few physical server types, order in bulk, and not have to keep so many spare components on hand.
For us, we are able to split out our datacenters by "critical"/"non-critical" for huge cost savings. Our "critical" datacenters host traditional production level servers. Things that MUST have up times of four/five nines. We can get 50Mbps 95th percentile, quarter rack, 10 amp for roughly $400 a month. These are great, but you have to make the most of each U.
We do a lot of machine learning, map/reduce, and general processing. The app needs this, but because I coded for it... if the uptime is, say, 99% and not 99.9999% it has VERY little impact on our end users (think of these as worker dynos). Now, I can have a whole rack here at the office able to handle for 90% of outages without issue. The nice thing about this is, I no longer have to make the most of each U. I can now build servers completely different than I would in a traditional datacenter. It also comes with little added expense to our normal operations (add redundant internet, networking, UPSes, and insurance). I can build a 1u, 25 ECU-equiv, 32GB, SSD based server for ~1.1k. Fill the rack! :)
These sound like excellent pointers. Do you have such large needs because of the type of pages/apps you are serving (streaming, for instance, or heavy analytical processes in the ML), or simply because you have a helluva lot of users?
The setup took a couple months to research, build, and make "perfect". However, ongoing it takes little time to maintain (less than an hour a week, if averaged). Every three years, I build new servers and place them into service (2 to 3 weeks of time to perform). I also do periodic hardware maintenance roughly every three months (typically 1/4 of a day to perform).
Due to the cost savings, we also are able to do quite a bit of redundancy, such as: dual PSUs, SSDs in RAID 10 on non-SAN servers, RAID-Z2/3 on SAN servers, offsite backups, complete server redundancy, spare servers ready to be slotted (I live an hour from colo), spare parts on hand, even multiple physical colos.
If components are selected carefully (i.e. sharing components between server roles), regular maintenance is performed, and redundancy is ensured on a per component, per server, and per datacenter level, it's not very time intensive or costly.
I am a software engineer by trade, but love the ins and outs of hardware/ops. As such, everything is automated and scripted (that can be). I can raise/move instances in minutes, just like EC2 (currently use XCP).
Even with the research, it still saves roughly 100k per 3 years.