And this is Railway, a big enough name to top the HN main page and presumably find someone from Google to intervene at some point. I would have zero recourse if it was some little product that I built.
Their account was restored in 10 / 19 minutes! It just took 4-6 hours to get everything fully healthy. I look forward to seeing the google response to this hopefully.
May 19, 22:10 UTC - Our automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue.
May 19, 22:11 UTC - Dashboard returning 503 errors. Users unable to log in.
May 19, 22:19 UTC - Root cause identified: Google Cloud Platform has suspended Railway's production account.
May 19, 22:22 UTC - P0 ticket filed with Google Cloud. Railway's GCP account manager engaged directly.
May 19, 22:29 UTC - Incident declared.
May 19, 22:29 UTC - GCP account access restored. All compute instances remained stopped and persistent disks inaccessible.
The timestamp inconsistency teraflop points out is
interesting — but the bigger takeaway for me is that
Railway's own automated API health checks caught the
failure at 22:10, a full 10 minutes before the root
cause was identified.
That's external dependency monitoring working exactly
as it should. Most teams only monitor their own
infrastructure. When a cloud provider, payment gateway,
or third-party API fails — your own dashboards show
green while users see failures.
The lesson isn't specific to GCP — it's that monitoring
what you depend on but don't control is just as
important as monitoring what you own.
100% agree, I've seen on Twitter and HN small players facing similar issues with no recourse and response from Google. I don't know what kind of place they are trying to build there.
They got TK to woo the enterprise customers who were forced to be hostage to OCI. But it seems they are still doing opposite of hostage here.
May 19, 22:10 UTC - Our automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue. May 19, 22:11 UTC - Dashboard returning 503 errors. Users unable to log in. May 19, 22:19 UTC - Root cause identified: Google Cloud Platform has suspended Railway's production account. May 19, 22:22 UTC - P0 ticket filed with Google Cloud. Railway's GCP account manager engaged directly. May 19, 22:29 UTC - Incident declared. May 19, 22:29 UTC - GCP account access restored. All compute instances remained stopped and persistent disks inaccessible.