| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by motakuk 1461 days ago

I agree that multi-component architecture is harder to deploy. We did our best and prepared tooling to make deployment an easy thing.

Helm (https://github.com/grafana/oncall/tree/dev/helm/oncall), docker-composes for hobby and dev environments.

Besides deployment, there are two main priorities for OnCall architecture: 1) It should be as "default" as possible. No fancy tech, no hacking around 2) It should deliver notifications no matter what.

We chose the most "boring" (no offense Django community, that's a great quality for a framework) stack we know well: Django, Rabbit, Celery, MySQL, Redis. It's mature, reliable, and allows us to build a message bus-based pipeline with reliable and predictable migrations.

It's important for such a tool to be based on message bus because it should have no single point of failure. If worker will die, the other will pick up the task and deliver alert. If Slack will go down, you won't loose your data. It will continue delivering to other destinations and will deliver to Slack once it's up.

The architecture you see in the repo was live for 3+ years now. We were able to perform a few hundreds of data migrations without downtimes, had no major downtimes or data loss. So I'm pretty happy with this choice.

4 comments

gen220 1461 days ago

I think your decisions were reasonable, as is the opinion of the person you're responding to.

To be fair, even in its current form, it should be possible to operate this system with sqlite (i.e. no db server) and in-process celery workers (i.e. no rabbit MQ) if configured correctly, assuming they're not using MySQL-specific features in the app.

Using a message bus, a persistent data store behind a SQL interface, and a caching layer are all good design choices. I think the OP's concern is less with your particular implementations, and more with the principle of preventing operators from bringing their own preferred implementation of those interfaces to the table.

They mentioned that it makes sense because you were a standalone product, so stack portability was less of a concern. But as FOSS, you're opening yourself up to different standards on portability.

It requires some work on the maintainer to make the application tolerant to different fulfillments of the same interfaces. But it's good work. It usually results in cleaner separation of concerns between application logic and caching/message bus/persistence logic, for one. It also allows your app to serve a wider audience: for example, those who are locked-in to using Postgres/Kafka/Memcached.

link

raffraffraff 1461 days ago

Nothing wrong with that. I managed 7+ Sensu "clusters" at a previous job, and it's stack was a ruby server, Redis and RabbitMQ. But I completely ditched RabbitMQ and used Redis for the queue and data. Simpler, more performant and more reliable (even if the feature was marked experimental). Our alerts were really spammy, and we had ~8k servers (each running a bunch of containers) per cluster, so these things were busy. Each cluster was 3x small nodes (6gb memory, 2CPU) Memory usage was miniscule, typically <300mb. Any box could be restarted without any impact because Redis just operated in (failover) mode and Sensu was horizontally scalable.

I get why you would add a relational DB to the mix. Personally, I'd like a Rabbit-free option.

link

Deritio 1461 days ago

Hearing your message bus assumption sounds like one of the most ridiculous claims I heard.

Sorry but why is rabbitmq really necessary?

link

slotrans 1461 days ago

You don't need Rabbit, Celery, or Redis. You should be able to replace MySQL with SQLite. Then it would be radically easier to deploy.

link

throwaway892238 1461 days ago

A MySQL database cluster, and a local copy of a SQL database on a single file on a single filesystem, are not close to the same thing. Except they both have "SQL" in the name.

One of them allows a thousand different nodes on different networks to share a single dataset with high availability. The other can't share data with any other application, doesn't have high availability, is constrained by the resources of the executing application node, has obvious performance limits, limited functionality, no commercial support, etc etc.

And we're talking about a product that's intended for dealing with on-call alerts. The entire point is to alert when things are crashing, so you would want it to be highly available. As in, running on more than one node.

I know the HN hipsters are all gung-ho for SQLite, but let's try to reign in the hype train.

link

pphysch 1461 days ago

This discussion is in the context of a self-contained app called Grafana OnCall, which is built on Django, which does not particularly care which RDBMS you are using.

At the very least, SQLite should be the default database for this product, and users can swap it out with their MySQL database cluster if they really are Google-scale.

link

gjulianm 1461 days ago

> The entire point is to alert when things are crashing, so you would want it to be highly available. As in, running on more than one node.

An important question to ask is how much availability are you actually gaining from the setup. It wouldn't be the first time I see a system moving from single-node to multinode and being less available than before due to the extra complexity and moving pieces.

link

slotrans 1461 days ago

I don't need any of that stuff, and nor does anyone who would use this. People who need clustered high-availability stuff are paying for PagerDuty or VictorOps.

This is for tiny shops with 4 servers. And tiny shops with 4 servers don't have time to spin up a horrendous stack like this. I was excited to see this announcement until I saw all the moving pieces. No thanks!

link

throwaway892238 1461 days ago

If you only have 4 servers, make a GitHub Action (or, hell, since we're assuming one node with SQLite, a cron job on one of your 4 servers) that curls your servers every 5 minutes and sends you a text when they're down. You don't need a Lamborghini to get groceries.

link

Spivak 1461 days ago

And this is the on-prem version of those tools. Just because it isn't the tool you wanted doesn't mean it's not good.

link

sergiomattei 1461 days ago

It’s curious to see people questioning the stack choices of apps they haven’t built yet and problems they haven’t faced either.

They chose this stack, it works for them. They’ve put it through its paces in production.

It’s as boring as it gets.

link