Hacker News new | ask | show | jobs
by terom 2033 days ago
Unsurprising to see such outages also tickling bugs/issues in the fallback behavior of dependent services that were intended to tolerate outages. There must be some classic law of cascading failures caused by error handling code :)

> Amazon Cognito uses Kinesis Data Streams [...] this information streaming is designed to be best effort. Data is buffered locally, allowing the service to cope with latency or short periods of unavailability of the Kinesis Data Stream service. Unfortunately, the prolonged issue with Kinesis Data Streams triggered a latent bug in this buffering code that caused the Cognito webservers to begin to block on the backlogged Kinesis Data Stream buffers.

> And second, Lambda saw impact. Lambda function invocations currently require publishing metric data to CloudWatch as part of invocation. Lambda metric agents are designed to buffer metric data locally for a period of time if CloudWatch is unavailable. Starting at 6:15 AM PST, this buffering of metric data grew to the point that it caused memory contention on the underlying service hosts used for Lambda function invocations, resulting in increased error rates.