Hacker News new | ask | show | jobs
by motakuk 1461 days ago
I agree that multi-component architecture is harder to deploy. We did our best and prepared tooling to make deployment an easy thing.

Helm (https://github.com/grafana/oncall/tree/dev/helm/oncall), docker-composes for hobby and dev environments.

Besides deployment, there are two main priorities for OnCall architecture: 1) It should be as "default" as possible. No fancy tech, no hacking around 2) It should deliver notifications no matter what.

We chose the most "boring" (no offense Django community, that's a great quality for a framework) stack we know well: Django, Rabbit, Celery, MySQL, Redis. It's mature, reliable, and allows us to build a message bus-based pipeline with reliable and predictable migrations.

It's important for such a tool to be based on message bus because it should have no single point of failure. If worker will die, the other will pick up the task and deliver alert. If Slack will go down, you won't loose your data. It will continue delivering to other destinations and will deliver to Slack once it's up.

The architecture you see in the repo was live for 3+ years now. We were able to perform a few hundreds of data migrations without downtimes, had no major downtimes or data loss. So I'm pretty happy with this choice.

4 comments

I think your decisions were reasonable, as is the opinion of the person you're responding to.

To be fair, even in its current form, it should be possible to operate this system with sqlite (i.e. no db server) and in-process celery workers (i.e. no rabbit MQ) if configured correctly, assuming they're not using MySQL-specific features in the app.

Using a message bus, a persistent data store behind a SQL interface, and a caching layer are all good design choices. I think the OP's concern is less with your particular implementations, and more with the principle of preventing operators from bringing their own preferred implementation of those interfaces to the table.

They mentioned that it makes sense because you were a standalone product, so stack portability was less of a concern. But as FOSS, you're opening yourself up to different standards on portability.

It requires some work on the maintainer to make the application tolerant to different fulfillments of the same interfaces. But it's good work. It usually results in cleaner separation of concerns between application logic and caching/message bus/persistence logic, for one. It also allows your app to serve a wider audience: for example, those who are locked-in to using Postgres/Kafka/Memcached.

Nothing wrong with that. I managed 7+ Sensu "clusters" at a previous job, and it's stack was a ruby server, Redis and RabbitMQ. But I completely ditched RabbitMQ and used Redis for the queue and data. Simpler, more performant and more reliable (even if the feature was marked experimental). Our alerts were really spammy, and we had ~8k servers (each running a bunch of containers) per cluster, so these things were busy. Each cluster was 3x small nodes (6gb memory, 2CPU) Memory usage was miniscule, typically <300mb. Any box could be restarted without any impact because Redis just operated in (failover) mode and Sensu was horizontally scalable.

I get why you would add a relational DB to the mix. Personally, I'd like a Rabbit-free option.

Hearing your message bus assumption sounds like one of the most ridiculous claims I heard.

Sorry but why is rabbitmq really necessary?

You don't need Rabbit, Celery, or Redis. You should be able to replace MySQL with SQLite. Then it would be radically easier to deploy.
A MySQL database cluster, and a local copy of a SQL database on a single file on a single filesystem, are not close to the same thing. Except they both have "SQL" in the name.

One of them allows a thousand different nodes on different networks to share a single dataset with high availability. The other can't share data with any other application, doesn't have high availability, is constrained by the resources of the executing application node, has obvious performance limits, limited functionality, no commercial support, etc etc.

And we're talking about a product that's intended for dealing with on-call alerts. The entire point is to alert when things are crashing, so you would want it to be highly available. As in, running on more than one node.

I know the HN hipsters are all gung-ho for SQLite, but let's try to reign in the hype train.

This discussion is in the context of a self-contained app called Grafana OnCall, which is built on Django, which does not particularly care which RDBMS you are using.

At the very least, SQLite should be the default database for this product, and users can swap it out with their MySQL database cluster if they really are Google-scale.

> The entire point is to alert when things are crashing, so you would want it to be highly available. As in, running on more than one node.

An important question to ask is how much availability are you actually gaining from the setup. It wouldn't be the first time I see a system moving from single-node to multinode and being less available than before due to the extra complexity and moving pieces.

I don't need any of that stuff, and nor does anyone who would use this. People who need clustered high-availability stuff are paying for PagerDuty or VictorOps.

This is for tiny shops with 4 servers. And tiny shops with 4 servers don't have time to spin up a horrendous stack like this. I was excited to see this announcement until I saw all the moving pieces. No thanks!

If you only have 4 servers, make a GitHub Action (or, hell, since we're assuming one node with SQLite, a cron job on one of your 4 servers) that curls your servers every 5 minutes and sends you a text when they're down. You don't need a Lamborghini to get groceries.
And this is the on-prem version of those tools. Just because it isn't the tool you wanted doesn't mean it's not good.
It’s curious to see people questioning the stack choices of apps they haven’t built yet and problems they haven’t faced either.

They chose this stack, it works for them. They’ve put it through its paces in production.

It’s as boring as it gets.