Hacker News new | ask | show | jobs
by cyco130 32 days ago
And this is Railway, a big enough name to top the HN main page and presumably find someone from Google to intervene at some point. I would have zero recourse if it was some little product that I built.
3 comments

Their account was restored in 10 / 19 minutes! It just took 4-6 hours to get everything fully healthy. I look forward to seeing the google response to this hopefully.

May 19, 22:10 UTC - Our automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue. May 19, 22:11 UTC - Dashboard returning 503 errors. Users unable to log in. May 19, 22:19 UTC - Root cause identified: Google Cloud Platform has suspended Railway's production account. May 19, 22:22 UTC - P0 ticket filed with Google Cloud. Railway's GCP account manager engaged directly. May 19, 22:29 UTC - Incident declared. May 19, 22:29 UTC - GCP account access restored. All compute instances remained stopped and persistent disks inaccessible.

The timestamp inconsistency teraflop points out is interesting — but the bigger takeaway for me is that Railway's own automated API health checks caught the failure at 22:10, a full 10 minutes before the root cause was identified.

That's external dependency monitoring working exactly as it should. Most teams only monitor their own infrastructure. When a cloud provider, payment gateway, or third-party API fails — your own dashboards show green while users see failures.

The lesson isn't specific to GCP — it's that monitoring what you depend on but don't control is just as important as monitoring what you own.

100% agree, I've seen on Twitter and HN small players facing similar issues with no recourse and response from Google. I don't know what kind of place they are trying to build there.

They got TK to woo the enterprise customers who were forced to be hostage to OCI. But it seems they are still doing opposite of hostage here.

This is the bigger point of all of this. Scary.