Hacker News new | ask | show | jobs
by slrainka 1815 days ago
While working at one of the top 3 Global airlines (around 2015), I deployed an experimental feature that streamed the real-time airport indoor location (activated upon entering a geo-fence) into the airline's iOS mobile app used by hundreds of thousands of customers daily.

Setup was, mobile app -> detect beacon & ping web endpoint with customer-id+beacon-uuid -> WAF -> Web application -> Internal Firewall -> Kafka Cluster -> downstream applications/use cases

It was an experiment — I didn't have high expectations for the number of customers who'd opt in to sharing their location. The 3 node Kafka cluster was running in a non-production environment. Location feed was primarily used for determining flow rates through the airport which could then predict TSA wait times, provide turn by turn indoor navigation and provide walk times to gates and other POIs.

About a week in, the number of customers who enabled their location sharing ballooned and pretty soon we were getting very high chatty traffic. This was not an issue as the resource utilization on the application servers and especially the Kafka cluster was very low. As we learned more about the behavior of the users, movements and the application, mobile team worked on a patch to reduce the number of location pings and only transmit deltas.

One afternoon, I upgraded one of the Kafka nodes and before I could complete the process, had to run to a meeting. When I came back about an hour later and started checking email, there were Sev-2/P-2 notifications being sent out due to a global slowdown of communications to airports and flight operations. For context, on a typical day the airline scheduled 5,000 flights. As time went on it became apparent that it was a Sev-1/P-1 that had caused a near ground stop of the airline, but the operations teams were unable to communicate or correctly classify the extent of the outage due to their internal communications also having slowed down to a crawl. I don't usually look into Network issues, but logged into the incident call to see what was happening. From the call I gathered that a critical firewall was failing due to connections being maxed out and restarting the firewall didn't seem to help. I had a weird feeling — so, I logged into the Kafka node that I was working on and started the services on it. Not even 10 seconds in, someone on the call announced that the connections on the firewall was coming down and another 60 seconds later firewall went back to humming as if nothing had happened.

I couldn't fathom what had happened. It was still too early to determine if there was a relationship between the downed Kafka node and the firewall failure. The incident call ended without identifying a root cause, but teams were going to start on that soon. I spent the next 2 hours investigating and following is what I discovered. ES/Kibana dashboard showed that there were no location events in the preceding hour prior to me starting the node. Then I checked the other 2 nodes that are part of the Kafka cluster and discovered that being a non-prod env they were patched during the previous couple of days by the IT-infra team and the Zookeeper and Kafka services didn't start correctly. Which meant the cluster was running on a single node. When I took it offline, the entire cluster was offline. I talked to the web application team who owned the location service endpoint and learned that their server was communicating with the Kafka cluster via the firewall that experienced the issue. Furthermore, we discovered that the Kafka producer library was setup to retry 3x in the event of a connection issue to Kafka. It became evident to us that the Kafka cluster being offline caused the web application cluster to generate exponential amount of traffic and DDoS'd the firewall.

Looking back, there were many lesson learned from this incident beyond the obvious things like better isolation of non-prod to and production envs. The affected firewall was replaced immediately and some of the connections were re-routed. Infra teams started doing better risk/dependency modeling of the critical infrastructure. On a side note, I was quite impressed by how well a single Kafka node performed and the amount of traffic it was able to handle. I owned up to my error and promptly moved the IOT infrastructure to cloud. In many projects that followed, these lessons were invaluable. Traffic modeling, dependency analysis, failure scenario simulation and blast radius isolation are etched into my DNA as a result of this incident.