| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by charliedevolve 3723 days ago
	Why did increasing the size of the hash help? Wouldn't the %64 (or whatever the new value was) just send all the port 53 sockets into the same bucket again? It seems you'd need a different hash function that provides more uniformity.

4 comments

kbenson 3722 days ago

For TCP connections our DNS server now binds to ANY_IP address (aka: 0.0.0.0:53, :53). We call this "bind to star". While binding to specific IP addresses is still necessary for UDP, there is little benefit in doing that for the TCP traffic. For TCP we can bind to star safely, without compromising our DDoS defenses.

I suspect that's the real fix. Now all those (16k) bound addresses aren't creating hash table entries, so other connections that happen to use a port that hashes to 21 (or 53 after enlarging the table) aren't being shoved into a hash bucket that starts with 16k entries already in it.

The enlarging of the hash table I think is less a fix for this problem (although it would halve the number of later connections being put in the bucket), and more just a good fix they happened to do at the same time.

creshal 3722 days ago

> The enlarging of the hash table I think is less a fix for this problem (although it would halve the number of later connections being put in the bucket), and more just a good fix they happened to do at the same time.

Yes. It just reduces the risk that they run into this problem again with a different port constellation.

stingraycharles 3722 days ago

A bit pedantic, but it doesn't necessarily reduce the risk, but rather the impact: using 64 buckets would only have half as many connections go into a bad bucket. This, however, does not in any way decrease the chance of the problem occuring again.

sinxoveretothex 3722 days ago

Using twice as many buckets, there will be half as many destination ports in the same bucket (65535 / 32 ≈ 2048, 65535 / 64 ≈ 1024), but since the "bad" connections described in the blogpost all use the same destination port, it won't change anything wrt that.

It does, however, reduce the overall impact when all connections are considered.

jcoffland 3722 days ago

There is only a reduction in hash collisions if the destination ports are fairly evenly distributed. I just don't see how this helps at all in the case described.

dhruvbird 3722 days ago

@kbenson Your explanation makes a lot more sense. Increasing the hash table size probably didn't affect the perf. significantly, but binding to the ANY ip did.

MBCook 3722 days ago

So are things 'bound to star' checked somewhere else in the path? I don't understand why the hash table wouldn't just have entries for each of the receiving connections still.

seabee 3722 days ago

There is only one listener bound to star, instead of 16k listeners for every IP. Thus, the hash bucket mapping to port 53 has only one entry instead of 16k.

It works out the same for the application: 1 fd or 16k fds doesn't really matter if you're using epoll, and that single fd can accept connections to any of those 16k IP addresses.

majke 3722 days ago

It would reduce the probability of collision. Sure, port 53 will always go to unlucky bucket. But port 16275 may or may not collide depending on thehash size.

charliedevolve 3722 days ago

Reduce the collisions how much, though? Are you using a port number == N*hashSize+53 for something else a lot? A collision is guaranteed for every connection to port 53, so isn't that a much bigger source of collisions? I think the hash size of 32 isn't the problem. It's the function.

majke 3722 days ago

I think I understand the confusion. The packets are hashed based on DESTINATION PORT.

Now, from a server point of view there are two types of connections: inbound and outbound. Our servers accept connections but they also establish connections, for example to your http origin hosts.

So from the point of view of our server the "colliding" packets will fit two categories: A) incoming packets to port 53 B) incoming packets to outbound connections which source port % 32 == 21.

For A) this is not that a big deal. DNS usually works over UDP, there are not _that_ many DNS queries done using TCP.

For B), since Linux choses source port incrementally, that means every 32'nd connection will possibly have some packets hitting the unhappy bucket.

Therefore increasing the hash size twice, reduces the chance of collision twice: now every 64th outbound connection will have some packets hitting the unhappy bucket.

charliedevolve 3722 days ago

Oh yes, I forgot that tcp/53 is mostly (if not entirely) for zone transfers.

majke 3722 days ago

:)

The full answer is: depending on which RFC you read :)

Initially the RFC's specified that you could only use TCP if you got UDP truncation _first_.

Nowadays that's relaxed but it's very vague when you should use TCP except for after UDP TR. For example Bind will try to connect over TCP if UDP fails.

Generally speaking most of the traffic goes over UDP, and sometimes, in undefined circumstances, some stuff may be requested over TCP. No hard rule.

bboreham 3722 days ago

Recalling that CloudFlare's business is dealing with malicious actors, you should assume that the caller has read all the RFCs and then deliberately disobeyed them.

xigency 3722 days ago

As others have answered, it doesn't help fix this problem in any way. I'm assuming the team doing this investigation wasn't willing to rewrite `__inet_lookup_listener`. I guess the quick fix became the concern at that point. (Compared to rewriting kernel modules.)

But yes, it seems like an unnecessary change.

jandrese 3722 days ago

I'm not sure inet_lookup_listener requires a rewrite for a general purpose kernel. They were hitting this problem because they were doing something unusual (starting thousands of listeners on the same port) and they found a pretty reasonable workaround.

In particular, a rewrite would have to make sure not to make the general case worse in an attempt to avoid this pathological situation.

majke 3722 days ago

Please read my other comment in this thread. https://news.ycombinator.com/item?id=11448987

Increasing the hash size doesn't fix the unhappy bucket, but it does reduce chance that packets will ever hit it. So yes, traversal of this bucket will be slow, but it will be hit less often.

jcoffland 3722 days ago

__inet_lookup_listener could be modified to use a two level hash that only kicked in for buckets which exceeded a certain threshold. The second level hash tables would xor the parts of the destination IP address modulo the hash table size. Special care would have to be taken for listeners on 0.0.0.0. This would solve the problem with negligible cost to the general case but at the expense of added complexity.

edwintorok 3722 days ago

In fact wouldn't it be simpler to just use an array with all 64k destination ports for the 1st level, and a hash based on destination IP for the 2nd level? This would only need 64k*sizeof(pointer) memory globally for the 1st level, which sounds reasonable for everything except low-memory embedded devices. Or does cache-locality matter here so much that its worth taking the hash-collision penalty on the 1st level?

For the 2nd level you could size the hash table appropriately since you always know the maximum number of IP addresses a host has.

Would be nice if this 2level array/hash was tunable from /proc or /sys.

jcoffland 3722 days ago

Increasing the hash table size would be bad for embedded systems and would not reduce the complexity for problems like the one described here. In general, hash table tuning is challenging but you shouldn't need to make the table significantly larger than some fraction of the total number of entries as long, as your hash function is effective. The problem here is that listeners are differentiated by both destination IP and port but the hash table only uses the port and then does a linear lookup by IP. As was demonstrated the hash function is not always effective and the list of listeners in a bucket can get quite long. Presumably the existing scheme was chosen because it is simple and makes it really easy to handle 0.0.0.0 listeners.

Using a binary tree at each bucket could also work well but you would have to rebalance the tree periodically if listeners were inserted in sorted order. A self balancing tree could be used instead but then again this adds complexity.

readams 3722 days ago

They might hash the source port also. Maybe they didn't mention this.

charliedevolve 3722 days ago

Mixing in the source port might be a good way to fix that hash function.

seabee 3722 days ago

The hash is used to find listening sockets. Source port doesn't help you there.

Destination IP is a reasonable addition to the hash function, however.