Hacker News new | ask | show | jobs
by bofaGuy 387 days ago
I run Redis across dozens of applications. So when Valkey became available for a discounted price on AWS I was excited. We finally got around to trying it out about 2 months ago and all was going well. No noticeable difference in performance. Until Valkey just died. It died in such a way that AWS still thought it was running happily but it was completely offline. It took 12+ hours for it to come up again and then it happened again... AWS researched the issue for 2 weeks and couldn't figure it out. It will be a long time before we attempt to use Valkey for anything critical in the future. We since have replace that Valkey with Redis under the same workload and have no issues.
4 comments

Probably AWS issue. Our production RDS postgres cluster did that a few months back. Just stopped responding on the network. Health checks were fine. AWS support was mostly useless and couldn't work it out in an hour, despite having their top tier enterprise support, so with customers down we had to create a whole new cluster and do a restore from backup which took 4 hours.

RDS is now gone. It's on a couple of EC2 instances with replication, hourly EBS snapshots and daily shipping to S3.

I'm loathed to use AWS's "encapsulated" services for anything since.

I think these are isolated incidents though. We’ve ran several tens of RDS clusters for 6 years running, and nothing has ever gone wrong. Maybe the ap-northeast-1 region is well maintained?
Could that be an AWS operational issue, and not related to Valkey?

I only run redis myself but wouldn't immediately place blame on Valkey if that happened.

Yeah I don't understand how something could be "completely offline" and still have health checks passing.
"completely offline" also doesn't sound like a problem with a software project. At best it's a particular managed service experiencing downtime. Would Linux be to blame if my power supply goes up in smoke?
It’s a bit confusing to me exactly what went wrong. I think that when you have a redis/valkey cluster with multiple nodes and you use the cluster uri, there must be some kind of load balancer or custom routing. When we would attempt to connect to valkey the connection would look good, but when we would submit commands to it they would never execute. We had written our application so that it would operate with no issue (just slower) if the cache goes down. In this case, connections looked good but no work was actually being done. AWS support suggested we restart the nodes but because they were not responding they never shut down … or at least it took a really long time. They were never able to tell us what actually happened. My guess is that valkey command execution got stuck somehow but was still able to create new connections.
“Completely offline” and passing health checks don’t typically go together…
Can’t be reached outside the network that the instance and health check are running on? Maybe available in one AZ, but not on the one that’s trying to connect.
Why you don't just run new instance with your own Valkey?
Because when you’re in production with many users, it’s not worth the risk when you’ve already been burned, especially when the downside is a small discount.
The aws managed cache offerings are not just a small premium, they're like 10x more expensive than the ec2 instance types they represent.
Its more but I doubt its 10x or even close to that.
It's not even 2x. I spot checked 2 instance types and they were 36% and 69% more.
What instance types were you using, just for reference?