Hacker News new | ask | show | jobs
by charliedevolve 3723 days ago
Why did increasing the size of the hash help? Wouldn't the %64 (or whatever the new value was) just send all the port 53 sockets into the same bucket again? It seems you'd need a different hash function that provides more uniformity.
4 comments

For TCP connections our DNS server now binds to ANY_IP address (aka: 0.0.0.0:53, :53). We call this "bind to star". While binding to specific IP addresses is still necessary for UDP, there is little benefit in doing that for the TCP traffic. For TCP we can bind to star safely, without compromising our DDoS defenses.

I suspect that's the real fix. Now all those (16k) bound addresses aren't creating hash table entries, so other connections that happen to use a port that hashes to 21 (or 53 after enlarging the table) aren't being shoved into a hash bucket that starts with 16k entries already in it.

The enlarging of the hash table I think is less a fix for this problem (although it would halve the number of later connections being put in the bucket), and more just a good fix they happened to do at the same time.

> The enlarging of the hash table I think is less a fix for this problem (although it would halve the number of later connections being put in the bucket), and more just a good fix they happened to do at the same time.

Yes. It just reduces the risk that they run into this problem again with a different port constellation.

A bit pedantic, but it doesn't necessarily reduce the risk, but rather the impact: using 64 buckets would only have half as many connections go into a bad bucket. This, however, does not in any way decrease the chance of the problem occuring again.
Using twice as many buckets, there will be half as many destination ports in the same bucket (65535 / 32 ≈ 2048, 65535 / 64 ≈ 1024), but since the "bad" connections described in the blogpost all use the same destination port, it won't change anything wrt that.

It does, however, reduce the overall impact when all connections are considered.

There is only a reduction in hash collisions if the destination ports are fairly evenly distributed. I just don't see how this helps at all in the case described.
@kbenson Your explanation makes a lot more sense. Increasing the hash table size probably didn't affect the perf. significantly, but binding to the ANY ip did.
So are things 'bound to star' checked somewhere else in the path? I don't understand why the hash table wouldn't just have entries for each of the receiving connections still.
There is only one listener bound to star, instead of 16k listeners for every IP. Thus, the hash bucket mapping to port 53 has only one entry instead of 16k.

It works out the same for the application: 1 fd or 16k fds doesn't really matter if you're using epoll, and that single fd can accept connections to any of those 16k IP addresses.

It would reduce the probability of collision. Sure, port 53 will always go to unlucky bucket. But port 16275 may or may not collide depending on thehash size.
Reduce the collisions how much, though? Are you using a port number == N*hashSize+53 for something else a lot? A collision is guaranteed for every connection to port 53, so isn't that a much bigger source of collisions? I think the hash size of 32 isn't the problem. It's the function.
I think I understand the confusion. The packets are hashed based on DESTINATION PORT.

Now, from a server point of view there are two types of connections: inbound and outbound. Our servers accept connections but they also establish connections, for example to your http origin hosts.

So from the point of view of our server the "colliding" packets will fit two categories: A) incoming packets to port 53 B) incoming packets to outbound connections which source port % 32 == 21.

For A) this is not that a big deal. DNS usually works over UDP, there are not _that_ many DNS queries done using TCP.

For B), since Linux choses source port incrementally, that means every 32'nd connection will possibly have some packets hitting the unhappy bucket.

Therefore increasing the hash size twice, reduces the chance of collision twice: now every 64th outbound connection will have some packets hitting the unhappy bucket.

Oh yes, I forgot that tcp/53 is mostly (if not entirely) for zone transfers.
:)

The full answer is: depending on which RFC you read :)

Initially the RFC's specified that you could only use TCP if you got UDP truncation _first_.

Nowadays that's relaxed but it's very vague when you should use TCP except for after UDP TR. For example Bind will try to connect over TCP if UDP fails.

Generally speaking most of the traffic goes over UDP, and sometimes, in undefined circumstances, some stuff may be requested over TCP. No hard rule.

Recalling that CloudFlare's business is dealing with malicious actors, you should assume that the caller has read all the RFCs and then deliberately disobeyed them.
As others have answered, it doesn't help fix this problem in any way. I'm assuming the team doing this investigation wasn't willing to rewrite `__inet_lookup_listener`. I guess the quick fix became the concern at that point. (Compared to rewriting kernel modules.)

But yes, it seems like an unnecessary change.

I'm not sure inet_lookup_listener requires a rewrite for a general purpose kernel. They were hitting this problem because they were doing something unusual (starting thousands of listeners on the same port) and they found a pretty reasonable workaround.

In particular, a rewrite would have to make sure not to make the general case worse in an attempt to avoid this pathological situation.

Please read my other comment in this thread. https://news.ycombinator.com/item?id=11448987

Increasing the hash size doesn't fix the unhappy bucket, but it does reduce chance that packets will ever hit it. So yes, traversal of this bucket will be slow, but it will be hit less often.

__inet_lookup_listener could be modified to use a two level hash that only kicked in for buckets which exceeded a certain threshold. The second level hash tables would xor the parts of the destination IP address modulo the hash table size. Special care would have to be taken for listeners on 0.0.0.0. This would solve the problem with negligible cost to the general case but at the expense of added complexity.
In fact wouldn't it be simpler to just use an array with all 64k destination ports for the 1st level, and a hash based on destination IP for the 2nd level? This would only need 64k*sizeof(pointer) memory globally for the 1st level, which sounds reasonable for everything except low-memory embedded devices. Or does cache-locality matter here so much that its worth taking the hash-collision penalty on the 1st level?

For the 2nd level you could size the hash table appropriately since you always know the maximum number of IP addresses a host has.

Would be nice if this 2level array/hash was tunable from /proc or /sys.

Increasing the hash table size would be bad for embedded systems and would not reduce the complexity for problems like the one described here. In general, hash table tuning is challenging but you shouldn't need to make the table significantly larger than some fraction of the total number of entries as long, as your hash function is effective. The problem here is that listeners are differentiated by both destination IP and port but the hash table only uses the port and then does a linear lookup by IP. As was demonstrated the hash function is not always effective and the list of listeners in a bucket can get quite long. Presumably the existing scheme was chosen because it is simple and makes it really easy to handle 0.0.0.0 listeners.

Using a binary tree at each bucket could also work well but you would have to rebalance the tree periodically if listeners were inserted in sorted order. A self balancing tree could be used instead but then again this adds complexity.

They might hash the source port also. Maybe they didn't mention this.
Mixing in the source port might be a good way to fix that hash function.
The hash is used to find listening sockets. Source port doesn't help you there.

Destination IP is a reasonable addition to the hash function, however.