|
|
|
|
|
by mschoebel
3752 days ago
|
|
Thanks for the answer. I suggest not to use AWS if you know that you'll need a server 24/7. Old-school hosters which offer dedicated servers are much cheaper for that use-case. There are several offers here in Europe where you can get an i7-6700, 64gb RAM and 1tb SSD for less than €60/month. AWS would cost you at least 3-4x as much. You'll lose the flexibility of AWS, but save a ton of cash. |
|
Isn't there more to the analysis than just comparing cpu before we can conclude it will save a lot of money?
It looks like their servers[1] use ~150TB source data that's already hosted on AWS disks. The source x.gz archives of the Common Crawl on AWS S3 are then imported to a Elasticsearch disks that are hosted on AWS.
To pull ~150TB of data using network speeds of 30 megabytes/sec[2] would take 60 days to transfer from AWS to another USA datacenter like Rackspace.
(Copying data from AWS to AWS isn't instantaneous either but it won't take ~60 days. At 60 days, the next crawl archive would have been released before you finished importing the previous one!)
Questions would be:
1) What are current 2016 network speeds between cloud providers?
2) What's the cost of ~150TB of network bandwidth?
3) From those datapoints, can we derive a rough rule-of-thumb where a certain amount of data exceeds the current capabilities (speed or economics) of the internet backbone available to projects like Common Search?
[1]https://about.commonsearch.org/developer/operations
[2]http://www.networkworld.com/article/2187021/cloud-computing/...