Hacker News new | ask | show | jobs
by bosdev 3787 days ago
There's no mention of why they don't have redundant systems in more than one datacenter. As they say, it is unavoidable to have power or connectivity disruptions in a datacenter. This is why reliable configurations have redundancy in another datacenter elsewhere in the world.
4 comments

Given the dependency in question is Redis, such a solution is probably exacerbated by the fact Redis hasn't really had a decent HA solution.

This is also hidden by the fact that Redis is really reliable (in my experience at least). In my experience it usually takes an ops event (like adding more RAM to the redis machine) to realize where a crutch has been developed on Redis in critical paths.

> Given the dependency in question is Redis, such a solution is probably exasperated by the fact Redis hasn't really had a decent HA solution.

Redis sentinel[0] is the HA solution for redis for quite some time.

[0]http://redis.io/topics/sentinel

That page says this:

Sentinel + Redis distributed system does not guarantee that acknowledged writes are retained during failures,

I haven't heard good things about Redis Sentinel nor am I sure of their failure modes, which is why I wouldn't describe it as decent.

I haven't kept up to date on "Sentinel 2" that was launched with 3.0 so the situation might have changed.

A lot of tools and services people use either don't have HA at all or don't have a native support for true distributed HA. But that can't stop people from making some HA or alike solution. I am not sure what they use Redis for but along the line of caching and key-value store they must have figured out how to invalidate data, otherwise they'd be running only a single instance of Redis. i.e. they are running "HA" just in a single data center, so logically speaking that's not difficult to port over to another data center.
I'm not familiar enough with Redis's clustering features to speak to the exact issues with what you're proposing, but generally speaking, HA is almost a completely different problem than disaster recovery (DR). Sure, the protocol is the protocol, but you wouldn't want to cluster local and remote nodes together for several reasons, primarily latency, security, and resiliency. Performance will suffer if they're clustered together and a single issue could take down nodes in both data centers, which kind of defeats the purpose.

What you really want is a completely separate cluster running in a different data center (site). It should be isolated on its own network and ideally it should have different admin rights/credentials and a different software maintenance (patching) schedule. A completely empty site isn't much use so you'll need some kind of replication scheme. Naturally, these isolating steps make site replication difficult. You might patch one site and now the replication stream is incompatible with the other site. (You can't patch both sites at the same time because the patch might take down the cluster.) Or whatever you're using to replicate the sites, which has credentials to both sites, breaks and blows everything up. You need a way to demote and promote sites and a constraint on only one site being the "master" at a time. What happens if network connectivity is lost between sites? What happens if one site is down for an extended period of time? Maybe you need a third, tie-breaking site?

Once you work through these issues, you are still exposed to user error. Your replication scheme might be perfect... perfect enough that that an inadvertently dropped table (or whatever) is instantly replicated to the other site and is now unrecoverable without going to tape. Maybe you introduce a delay in the replication to catch these oopsies, but now your RPO is affected. Anyway, it's a bit of a shell game of compromises and margins of error.

Source: 10 years designing and building HA/DR solutions for Discover Card.

I was also wondering what they are using Redis for; found this article [1] from a while ago discussing Redis at Github; presumably the architecture has moved on a bit since then, but this may shed a bit of light on the subject.

[1]: https://github.com/blog/530-how-we-made-github-fast

Hello, Sentinel 2 is doing quite a good work for many users, here is for example Flickr report: http://code.flickr.net/2014/07/31/redis-sentinel-at-flickr/

Of course Sentinel does not make Redis conceptually different from what it is from the point of view of consistency guarantees during failures. It performs best-effort attempt to select the best slave to retain writes, but under certain failure modes its possible to lose writes during a failover.

This is common with many failover solutions of *SQL systems as well btw. It depends on your use case if this is an affordable risk or not. For most Redis use cases, usually the risk of losing some writes after certain failovers is not a big issue. For other use cases it is, and a store that retains the writes during all the failure scenarios should be used.

> Given the dependency in question is Redis, such a solution is probably exasperated by the fact Redis hasn't really had a decent HA solution.

You can replicate to read-only instances in a secondary DC and failover. It hurts but it is better than an outage imo.

>Redis is really reliable (in my experience at least)

Redis has been demonstrated[0][1] to lose data under network partitions. This is particularly concerning when discussing the type of partial failure that GitHub reported.

0: https://aphyr.com/posts/283-jepsen-redis

1: https://aphyr.com/posts/307-jepsen-redis-redux

I meant Redis is really reliable as a single instance. If you reread my post I mentioned that Redis doesn't have a decent HA solution.
Not sure how your comment refutes the contention of reliability. Seems to me to be more a condemnation of failures that do happen (which is of course worthy of concern, but irrelevant in a conversation about stability).
I am using "reliability" in the sense of RAS[0]. An HA datastore which erroneously ACK's writes has lowered reliability, as there are known cases where it gives incorrect outputs.

0: https://en.wikipedia.org/wiki/Reliability,_availability_and_...

I see. Yeah, more concerning there being errors not based on an "event". Thanks for clarifying; sorry for any confusion.
> There's no mention of why they don't have redundant systems in more than one datacenter

sometimes reading comments on hn makes me laugh out loud.

there's only one reason to not do this, and that's cost. what do you expect them to say about that? i mean really, you think they're going to put that in a blog post:

"Well, the reason we don't have an entire replica of our entire installation is because it costs way too much. In fact, more than double! And so far our uptime is actually 99.99% so there's no way it's worth it! You can forget about that spend! Sorry bros."

This is not only obviously true, I think it is also a completely reasonable calculus. They just proved that if the entire Redis cluster goes down they can get it back in 2.5 hours. It's almost certainly a caching layer, so there is no permanent data loss. If they fix the application bootstrap dependency on a Redis connection, and they add monitoring to more easily see in the future when the Redis cluster is the problem, next time that time period will probably be way shorter.

So, a very small risk of an hour or so of downtime sometime in the future which will not cause data loss, or tens of thousands of dollars a month for a failover cluster? I wouldn't replicate it either.

>It's almost certainly a caching layer, so there is no permanent data loss.

People who use Redis rarely end up using it solely as a caching layer. It often also takes on the role of an RPC facilitator and pseudo-database. GitHub's post also mentions that their engineering team had to replicate Redis' dataset before they could get the alternative hardware running, which implies that they do need some data in there before the site is operational.

Personally one of my pet peeves is people throwing mission-critical data in Redis and acting like it's honky-dory. It happens all the time and seems really difficult to get people to not do. There's a reason we have a real ACID compliant database storing non-disposable data; it's ridiculous to ignore that just because it's easier to stuff it in Redis.

I think it's reasonable to have a dependency on a Redis server, but I don't think it's reasonable to depend on any data in particular being stored in that server. It should be used as a caching/acceleration layer for data that can be easily and automatically regenerated.

Just a thought on something I've learned over a few years. Sometimes, the most correct way isn't necessarily the best. Example here might be that the redis db is being used to store data which is constantly being read. While being in a MySQL instance might be the most correct method, the end result might actually be slower. This is just my naive guess but the point is, sometimes, given a particular context, the value of taking a hacky/less correct solution becomes great enough to use it
It's solely about the effort; it's a lot easier to just say redis.set('some_random_name') = value than it is to figure out where something should go in the schema of a RDBMS. If the data needs to persist, it needs to be written to a database that provides good guarantees about data integrity. If someone wants to load the results of a query into Redis, more power to them, but I've come across a lot of people who just stuff things in memory-backed K-V stores with the apparent expectation that nothing could ever happen to that data. Developers have told me "Well, Redis writes to disk on shutdown, right?" and acted like that was good enough for permanent storage of mission-critical data.

I have no fundamental opposition to K-V stores or NoSQL databases, but I do think most developers favor them because it's easier to stuff them with data up front. There are big tradeoffs down the road, though, which companies don't seem to understand well, and which they aren't really equipped to handle.

I unfortunately am not equipped with the knowledge about how people use/abuse redis like storage mechanisms. But that bit about how NoSQL is used as a point of upfront convenience is bam spot on. The biggest reason people have given me when I ask them why they want mongo is "easier to add columns".
Maybe they do feel it's a reasonable business decision. In that case they shouldn't be surprised if a lot of their users make the equally reasonable business decision to reduce their exposure to Github.

A lot of people have started depending on github for more than just stashing source code some place centrally accessible as they're working on it. If github takes a lax attitude toward uptime then I suspect people will start looking for alternatives.

It's shocking that they don't at least have a read replica of their system in another 'AZ'. That's cloud hosting 101, and being self-hosted isn't an excuse to skimp on this.

If an outage caused 2 hours of read-only access to repos it would still be moderately impactful, but at least we could still build our Go code.

For people reading this, AZ in this context would be Availability Zone
For people reading this, 'Availability Zone' in this context would be (AWS speak for) 'datacentre'. :)
Right, and not the Grand Canyon State.

The space of acronyms/abbreviations is quite cluttered.

Heh, not from the US initially so that overloading did not occur to me :)

That's an interesting alternate reading though...

"To ensure the integrity of our data, we need to locate another Arizona, since this one is serving us so well."

Right, and not the Grand Canyon State.

The space of acronyms/abbreviations is quite cluttered.

So your building process depends on the availability of an external company?
Seriously. I'm kind of surprised about this.
Yeah, they gloss over it but at its heart, keeping mission-critical servers in a single datacenter with no redundancy is among the most common and amateur infrastructure failures. Many would expect a company like GitHub to have anticipated and prevented it. GitHub should have a process to ensure that all services are redundant before they get pushed to production.