Hacker News new | ask | show | jobs
by photonios 1159 days ago
If there's anyone reading this that is planning on deploying Keycloak in a high availability environment, I would highly recommend that you persist all sessions in the database as offline sessions.

At work, I ran 9 Keycloak clusters in production, handling tens of millions of sessions where the cost of losing sessions was high. The amount of time we wasted on getting it to work reliably with its default configuration of storing the sessions in its distributed, in-memory cache (Infinispan) is insane. It just isn't designed to handle such a work load reliably. Unless you're willing to spent months tuning it for every possible scenario, you WILL lose sessions.

If you are in this situation, shoot me an email. I have been through this pain and it took a lot of painstaking work to get to a highly reliable set up at scale.

4 comments

Hi, You might want to take a look at the new storage in keycloak[1].

Newer keycloak versions (19 and up) have a configurable storage for the auth sessions (see storage-area-auth-session and storage-area-user-session). I haven't checked them but the documentation is promising.

For older session (last time I checked keycloak 15) you might want to use offline sessions but they don't allow SSO after the auth session was evicted from infinispan.

1 - https://www.keycloak.org/2022/07/storage-map.html

> I would highly recommend that you persist all sessions in the database as offline sessions.

Please! Post it, thanks

Was it something about Infinispan, or Keycloak?

We were wondering about Redis in a similar IAM use-case (PingFederate) but it wasn't officially supported, so we decided to just go with persistent Postgres. I wonder if we saved ourselves a bunch of heartache.

We often experienced cascading failure, especially during rolling restarts. A node would start shutting down and Infinispan would start to try to rebalance. Due to the large volume of sessions, other nodes would start to become unresponsive and stop replying. Eventually, you'd end up in a situation where it would give give up trying to shut the node down cleanly and just kill itself. That wouldn't be a big deal if you weren't doing a rolling restart. When the first node doesn't shut down cleanly, the data should be "safe" since it is replicated to at least N owners. In practice, the other nodes also get restarted, also shut down uncleanly and sessions are lost. Secondly, as the cluster became unresponsive, requests to refresh sessions would start to time out, which would also cause those sessions to be "lost" since they would eventually hit the maximum idle time.

As long as we wouldn't do any restarts, it would sort of work. Problems would pop up when due to high load, one or more nodes would become unresponsive and liveness probes would restart nodes. That would often cause the kind of cascading failure described above.

Most of these problems are also the result of running it in Kubernetes. We very quickly learned to remove the liveness probes and to massively increase the grace period. This helped, but only so much. We still had rather frequent failures similar to the one I just described.

Maybe if we wouldn't have run it in Kubernetes and we would be more knowledgeable about Infinispan, we could've gotten a stable set up. For us, as a small team without that specialized knowledge, we struggled to get a stable set up.

Ah, the infinite fun of managing distrubuted systems, I've seen similar failure modes in pretty much anything distributed. While in one node systems the spike of traffic just causes it to sorta work slow, cascading failures caused by latency plague most of the distributed ones.

Whether it's process management or just say node having too little memory and spinning in GC too much.

Mixing app and DB (which is I guess happening here) also can be fun, as now app being overloaded can cause DB being overloaded. You'd probably be just fine if infinispan was used as a remote database instead of embedded one.

Do you have a blog post or something detailing what you did and how you did it?
I found this: https://www.janua.fr/offline-sessions-and-offline-tokens-wit.... janua.fr is a very solid Keycloak resource. The write up is for a pretty aged Keycloak version but there are probably some decent pointers in there.
This article gets pretty close, but it misses a very critical piece. If you're running Keycloak 16 or older, you'll explicitly want to enable lazy offline session loading [0]. Otherwise, Keycloak will attempt to load ALL offline sessions in memory during startup.

Keycloak 17 made offline sessions lazily loaded by default.

[0] https://www.keycloak.org/docs/16.1/server_admin/#offline-ses...