Hacker News new | ask | show | jobs
by nemothekid 3787 days ago
Given the dependency in question is Redis, such a solution is probably exacerbated by the fact Redis hasn't really had a decent HA solution.

This is also hidden by the fact that Redis is really reliable (in my experience at least). In my experience it usually takes an ops event (like adding more RAM to the redis machine) to realize where a crutch has been developed on Redis in critical paths.

5 comments

> Given the dependency in question is Redis, such a solution is probably exasperated by the fact Redis hasn't really had a decent HA solution.

Redis sentinel[0] is the HA solution for redis for quite some time.

[0]http://redis.io/topics/sentinel

That page says this:

Sentinel + Redis distributed system does not guarantee that acknowledged writes are retained during failures,

I haven't heard good things about Redis Sentinel nor am I sure of their failure modes, which is why I wouldn't describe it as decent.

I haven't kept up to date on "Sentinel 2" that was launched with 3.0 so the situation might have changed.

A lot of tools and services people use either don't have HA at all or don't have a native support for true distributed HA. But that can't stop people from making some HA or alike solution. I am not sure what they use Redis for but along the line of caching and key-value store they must have figured out how to invalidate data, otherwise they'd be running only a single instance of Redis. i.e. they are running "HA" just in a single data center, so logically speaking that's not difficult to port over to another data center.
I'm not familiar enough with Redis's clustering features to speak to the exact issues with what you're proposing, but generally speaking, HA is almost a completely different problem than disaster recovery (DR). Sure, the protocol is the protocol, but you wouldn't want to cluster local and remote nodes together for several reasons, primarily latency, security, and resiliency. Performance will suffer if they're clustered together and a single issue could take down nodes in both data centers, which kind of defeats the purpose.

What you really want is a completely separate cluster running in a different data center (site). It should be isolated on its own network and ideally it should have different admin rights/credentials and a different software maintenance (patching) schedule. A completely empty site isn't much use so you'll need some kind of replication scheme. Naturally, these isolating steps make site replication difficult. You might patch one site and now the replication stream is incompatible with the other site. (You can't patch both sites at the same time because the patch might take down the cluster.) Or whatever you're using to replicate the sites, which has credentials to both sites, breaks and blows everything up. You need a way to demote and promote sites and a constraint on only one site being the "master" at a time. What happens if network connectivity is lost between sites? What happens if one site is down for an extended period of time? Maybe you need a third, tie-breaking site?

Once you work through these issues, you are still exposed to user error. Your replication scheme might be perfect... perfect enough that that an inadvertently dropped table (or whatever) is instantly replicated to the other site and is now unrecoverable without going to tape. Maybe you introduce a delay in the replication to catch these oopsies, but now your RPO is affected. Anyway, it's a bit of a shell game of compromises and margins of error.

Source: 10 years designing and building HA/DR solutions for Discover Card.

I was also wondering what they are using Redis for; found this article [1] from a while ago discussing Redis at Github; presumably the architecture has moved on a bit since then, but this may shed a bit of light on the subject.

[1]: https://github.com/blog/530-how-we-made-github-fast

Hello, Sentinel 2 is doing quite a good work for many users, here is for example Flickr report: http://code.flickr.net/2014/07/31/redis-sentinel-at-flickr/

Of course Sentinel does not make Redis conceptually different from what it is from the point of view of consistency guarantees during failures. It performs best-effort attempt to select the best slave to retain writes, but under certain failure modes its possible to lose writes during a failover.

This is common with many failover solutions of *SQL systems as well btw. It depends on your use case if this is an affordable risk or not. For most Redis use cases, usually the risk of losing some writes after certain failovers is not a big issue. For other use cases it is, and a store that retains the writes during all the failure scenarios should be used.

> Given the dependency in question is Redis, such a solution is probably exasperated by the fact Redis hasn't really had a decent HA solution.

You can replicate to read-only instances in a secondary DC and failover. It hurts but it is better than an outage imo.

>Redis is really reliable (in my experience at least)

Redis has been demonstrated[0][1] to lose data under network partitions. This is particularly concerning when discussing the type of partial failure that GitHub reported.

0: https://aphyr.com/posts/283-jepsen-redis

1: https://aphyr.com/posts/307-jepsen-redis-redux

I meant Redis is really reliable as a single instance. If you reread my post I mentioned that Redis doesn't have a decent HA solution.
Not sure how your comment refutes the contention of reliability. Seems to me to be more a condemnation of failures that do happen (which is of course worthy of concern, but irrelevant in a conversation about stability).
I am using "reliability" in the sense of RAS[0]. An HA datastore which erroneously ACK's writes has lowered reliability, as there are known cases where it gives incorrect outputs.

0: https://en.wikipedia.org/wiki/Reliability,_availability_and_...

I see. Yeah, more concerning there being errors not based on an "event". Thanks for clarifying; sorry for any confusion.