Have you had any issues with reliability or downtime in running it yourself? Or any use cases it hasn't supported? Interesting to hear you're running it with so many users!
We did some load testing and noticed that the first thing to become a bottleneck was the database, after running 4k concurrent logins or so. I suspect we'd have to introduce propper postgres connection pooling to overcome that (we were running a single postgres instance).
One thing to keep in mind as well is that if you create an SPI extension to cover a spicific usecase, you'll have to add your own metrics collection. It was a bit of overhead to configure in prometheus, since you'll end up having a metrics endpoint to scrape for each SPI.
We did some load testing and noticed that the first thing to become a bottleneck was the database, after running 4k concurrent logins or so. I suspect we'd have to introduce propper postgres connection pooling to overcome that (we were running a single postgres instance).
You do need to watch out for the way it caches things though . I suggest to read the relevant documentation, as it works a bit differently in cluster mode https://www.keycloak.org/docs/latest/server_installation/#_o...
One thing to keep in mind as well is that if you create an SPI extension to cover a spicific usecase, you'll have to add your own metrics collection. It was a bit of overhead to configure in prometheus, since you'll end up having a metrics endpoint to scrape for each SPI.