Hacker News new | ask | show | jobs
by xigency 3726 days ago
As others have answered, it doesn't help fix this problem in any way. I'm assuming the team doing this investigation wasn't willing to rewrite `__inet_lookup_listener`. I guess the quick fix became the concern at that point. (Compared to rewriting kernel modules.)

But yes, it seems like an unnecessary change.

3 comments

I'm not sure inet_lookup_listener requires a rewrite for a general purpose kernel. They were hitting this problem because they were doing something unusual (starting thousands of listeners on the same port) and they found a pretty reasonable workaround.

In particular, a rewrite would have to make sure not to make the general case worse in an attempt to avoid this pathological situation.

Please read my other comment in this thread. https://news.ycombinator.com/item?id=11448987

Increasing the hash size doesn't fix the unhappy bucket, but it does reduce chance that packets will ever hit it. So yes, traversal of this bucket will be slow, but it will be hit less often.

__inet_lookup_listener could be modified to use a two level hash that only kicked in for buckets which exceeded a certain threshold. The second level hash tables would xor the parts of the destination IP address modulo the hash table size. Special care would have to be taken for listeners on 0.0.0.0. This would solve the problem with negligible cost to the general case but at the expense of added complexity.
In fact wouldn't it be simpler to just use an array with all 64k destination ports for the 1st level, and a hash based on destination IP for the 2nd level? This would only need 64k*sizeof(pointer) memory globally for the 1st level, which sounds reasonable for everything except low-memory embedded devices. Or does cache-locality matter here so much that its worth taking the hash-collision penalty on the 1st level?

For the 2nd level you could size the hash table appropriately since you always know the maximum number of IP addresses a host has.

Would be nice if this 2level array/hash was tunable from /proc or /sys.

Increasing the hash table size would be bad for embedded systems and would not reduce the complexity for problems like the one described here. In general, hash table tuning is challenging but you shouldn't need to make the table significantly larger than some fraction of the total number of entries as long, as your hash function is effective. The problem here is that listeners are differentiated by both destination IP and port but the hash table only uses the port and then does a linear lookup by IP. As was demonstrated the hash function is not always effective and the list of listeners in a bucket can get quite long. Presumably the existing scheme was chosen because it is simple and makes it really easy to handle 0.0.0.0 listeners.

Using a binary tree at each bucket could also work well but you would have to rebalance the tree periodically if listeners were inserted in sorted order. A self balancing tree could be used instead but then again this adds complexity.