| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ismarc 4983 days ago

When I saw the title, I thought it was going to be coverage of how under-the-hood EventMachine's model only allows it to scale up to a certain throughput per-process (which is reasonably high for most uses of it) or how there is some fundamental complexity in how it's written that pretty much guarantees there will be livelock conditions (and deadlock conditions). So I share your bewilderment about why they chose the things to highlight that they did.

The lock conditions it has are for particular workloads at particular throughput/utilization, so 99% of the people using it won't hit those conditions initially (and a lot of people aren't writing systems that will ever reach those limits). However, we have a particular service that acts as a TCP multiplexer/router/load balancer to provide high availability for some of our mission critical applications. A little over a year ago, I initially wrote it using EventMachine in a couple of weeks, then spent almost a month trying to find what I thought was a bug in my code where a full deadlock would occur if 3 or more connections were established in a small enough time frame. Turned out to be a bug in EventMachine. After fixing that one, I found another where if 5 or more open connections fired the same event within a small enough time frame, all 5 would hit livelock and given enough connections hitting the condition, the whole process would deadlock. Once I hit the second bug in EventMachine itself directly related to concurrency handling, I switched to Netty and rewrote the whole thing in Scala in about a week and it's been rock solid since.

I've been doing some development on the side in Go because its particular flavor of types and concurrency model is fascinating. It's not that Go is anti-event, it's that it's overwhelmingly stream oriented (not low-level streams, but data streams). Once I finally hit that moment of clarity that goroutines/channels was all about connecting streams of data and not connecting raw streams, they became a much more natural solution. However, don't get me started on the difference between a non-blocking read on a channel and a blocking read on a channel.

1 comments

tptacek 4982 days ago

You're threading, I presume? We do high speed high connection rate work in EventMachine (for instance, to work with order entry for trading exchanges) and have never seen anything like this --- but we religiously avoid Ruby threads.

ismarc 4982 days ago

Yeah we are, as it was the least-convoluted method for handling the multiplexing (but wanting to maintain the same handler for all of them). Burned entirely too much time assuming that documented ways to handle things were actually rock solid, not rarely used. The first bug where 3 connections could deadlock everything was unrelated to threading. It was related to how it created identifiers for each connection and would create identical identifiers for two different connections so it would think there was data pending for a socket, but the data had already been read, so it tried to do a blocking read when there wasn't data to read. I never fully tracked down the second bug, but it was more likely related to using threading.

One thing we did have was a large majority of the connections were coming from the same IP (but different source port) and a peak connection rate of ~200 connections per second per server and would last a few seconds (so at any given moment, roughly 1000 connections being established or with data in flight). I'd be really interested in hearing what your traffic patterns are roughly like (and hand-wavy what you have event machine doing, like proxying requests, building/returning its own responses, etc.) if you're willing to share at all.