Hacker News new | ask | show | jobs
by JPLeRouzic 1083 days ago
The algorithm found seems so unintuitive that I wonder if it was not found by the AI.

"Allow a connection to continue if the first TCP payload (pkt) sent by the client satisfies any of the following exemptions:

Ex1: popcount(pkt) len(pkt) ≤ 3.4 or popcount(pkt) len(pkt) ≥ 4.6.

Ex2: The first six (or more) bytes of pkt are [0x20,0x7e].

Ex3: More than 50% of pkt’s bytes are [0x20,0x7e].

Ex4: More than 20 contiguous bytes of pkt are [0x20,0x7e].

Ex5: It matches the protocol fingerprint for TLS or HTTP.

Block if none of the above hold."

6 comments

It's extremely intuitive. You're trying to filter unusual, encrypted traffic.

First rule exploits the IND-CPA property of most encryption. You want to kill traffic that has about 4 bits set to 1 per byte, i.e. traffic that "looks random".

The following rules are exemptions for permissible encrypted or compressed traffic (note that compression, while not IND-CPA, results in high entropy and thus will trigger the first rule).

This could work very well, which is confirmed by the researchers in this paper.

Cool, I'll just base32 my encrypted traffic and sail through!
Yep, or simply add "GET " in front of it. Best add it in front of all packets.
Thanks for the information, it is very interesting.
> I wonder if it was not found by the AI.

Do you mean "found" by the CCP, or "found" by the researchers? In the case of the CCP it was likely generated through basic statistical analysis, and tuned to minimize side effects and collateral damage below some threshold of acceptability (~0.6% of global traffic unintentionally blocked). In the case of the researchers, the paper details the basic statistical analysis used to discover these rules.

Ex1 is just excepting low-entropy packets (distribution of 1s and 0s tends toward the mean for high-entropy data). Encrypted data presents as high-entropy. This is a crude method (errs on the side of not excepting) but is very efficient for embedded hardware to compute.

Ex2-4 are just excepting ASCII text, which is used by many unencrypted protocols (e.g. IMAP), but which are high enough entropy that they statistically will fail the first test often.

Ex5 is necessary because TLS is high-entropy (by nature of being encrypted). HTTP is also excepted presumably so e.g. compressed uploads (e.g. images/video) aren't flagged.

That "low entropy" is the key to bypassing the GFW isn't surprising at all -- high entropy is all but a necessary feature of most cryptography schemes. (I say "all but" because -- encryption isn't adding information, so unless you compress before you encrypt, it's possible for a (hypothetical) encryption scheme to preserve entropy, according to several objective metrics. I don't know of any that do this, beside the meta scheme of compression before encrypting, followed by steganographically padding the encrypted data afterward. This of course leaks some information through the encryption -- equal to the negentropy of the message -- but it would typically be information that can't be gleaned from context, e.g. that the message is HTML+text.)

So... base64-encode your TLS?

Looks more like it was found using random forests.
Lol you guys never worked with real data :D

There's at least 1'000 such algorithms at each google-like company.

That looks hilariously easy to defeat though it will require introducing "0x20,0x7e" padding to protocols heh.
This is just some experimentation results, it's not algorithm.
An algorithm is just a bunch of rules to follow to perform an operation, so this looks like an algorithm to me.
You misunderstood your parent comment. What he/she meant is that the "algorithm" is only a guess from reverse engineering. The actual algorithm deployed at GFW can look significantly different.
Oh yeah, that's definitely not how I read it. Thanks!