Hacker News new | ask | show | jobs
by panghy 2987 days ago
We (Wavefront) has been operating petabyte scale clusters for the last 5 years with FoundationDB (we got the source code via escrow) and we are super excited to be involved in the opensourcing of FDB. We have operated over 50 clusters on all kinds of aws instances and I can talk about all the amazing things we have done with it.

https://www.wavefront.com/wavefront-foundationdb-open-source...

1 comments

We basically replaced mySQL, Zookeeper and HBase with a single KV store that supports transactions, watches, and scales. It's not a trivial point that you can just develop code against a single API (finally Java 8 CompletableFutures) and not have to set up a ton of dependencies when you are building on top of FDB. We are (obviously) experts at monitoring FoundationDB with Wavefront and we hope to release the metric harvesting libraries and template dashboards that we use to do so.

Almost 5 years in and we have not lost any data (but we have lost machines, connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual day in AWS =p).

"but we have lost machines, connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual day in AWS " <=== This I wish more people realized that is a day to day reality if you are in AWS at scale.
The best way I've heard it described is "complex systems run in degraded mode".

https://cdn.chrisshort.net/How-Complex-Systems-Fail.pdf

Basically once a system is complex enough some part if it is always broken. The software must be designed from the assumption that the system is never running flawlessly.

No doubt but that's pretty high overhead for many projects colo is actually a decent choice but I guess that's not a popular opinion.
As I understand it, it's like that everywhere at scale, not just on AWS, it being a property of operating at scale.

Or are you saying that AWS is particularly unreliable at scale?

I seem to think that cloud providers are particular opaque about small glitches (i.e. they aren't going to tell you that a router or switch was rebooted for maintenance if it comes back right away and you can email support and it's always the same response: "it's working right now") :)
On the network side no, it's much more crappy on AWS.
Which provider is the best, network wise?
I only have experience with AWS and on prem and high quality colo like Equinix. Possibly due to reduced complexity and having full control over networking setup but significantly fewer issues vs AWS.
And FoundationDB has held our data durable through all of this.
Sounds like however bolts on PG compatible SQL layer on top will have a killer product on their hands :)
Have a look at CockroachDB
Already playing with it but FoundationDB is used for production Petabyte scale deployments, and the whole deterministic simulation thing for testing is really reassuring as far as bugs/stability. I am guessing with Apple's resources that approach was taken to a whole new level after the acquisition?