| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dominikz 65 days ago

I was helping a small e-commerce shop moving from Hetzner bare metal to Hetzner Cloud. Initially I thought that the difficulty will be in moving the data, but I got surprised. The difficulty was the fact that the application had absolutely zero observability.

If I could turn back the clock a year back, and evaluate the tool you are proposing against what happened, here's how it would look like.

The first thing we had problems with was performance. We moved to smaller Hetzner Cloud machines and split a multi-tenant bare metal systems to fine-grained virtual machines. Having no metrics, meant that we were absolutely blind time-wise. We could log into console and issue 'top', but we couldn't do this after the fact. Decision: self-hosted graphite. I see you have metrics in tracewayapp -> +1.

Now 2 months fast forward. The second problem we had was stability. Because we now moved from a single stable machine to around 200 unstable cloud machines, we had no idea which system is up and which isn't. We did a research of how to outsource uptime. We had online meetings with sales teams of uptime.com and uptimerobot. The initial cost was doing two of those 30-minute sales/fit calls. But that's marginal. The real cost would be something that they price if I remember correctly 1USD/probe/mo. We'd need 200 probes by their definition. The pricing is what killed the deals. Decision: self-host uptimekuma. Initial cost of in-house setup and then just the cost of the smallest hetzner machine which is 2.99EUR/mo. We heavily rely on uptimekuma->slack integration for notifications. I see no uptime tool in tracewayapp -> -1.

Another 3 months have passed. We stopped looking at graphite dashboards on day-to-day basis. Natural human optimization. Systems started going down because running out of disk space, or bugs that exhausted connection pools to the database (twice a month for one of the biggest customers). We quickly realized we need threshold notifications based on metrics. Decision: self-host moira. Heavily rely on slack integration. It's hard to find whether there's something similar in tracewayapp -> -1 (correct me if I'm wrong).

Some few months fast forward. Some of the deployments had a bug that resulted in a flood of exceptions. Even though the system was up (uptimekuma green), some critical services were not working. We did a tricky hibernate upgrade and only found out 4 hours after the deployment that the system is not working. Decision: integrate logging with metrics (graphite) and moira to trigger slack notifications when say #errors > threshold. It's hard to say whether this workflow is easily configurable in tracewayapp -> -1.

Can you elaborate more on the points where I perhaps might have rated your app negatively, but you actually actively support these scenarios and the information is burried somewhere deep in your docs (a few pointers would be helpful if this is the case).

What I like about, what looks like a very complex stack I ended up with, is the fact that it works much the same way as UNIX pipes: I can pretty much change one piece in this flow - ie. I am avoiding vendor lock-in.