| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JPLeRouzic 1083 days ago

The algorithm found seems so unintuitive that I wonder if it was not found by the AI.

"Allow a connection to continue if the first TCP payload (pkt) sent by the client satisfies any of the following exemptions:

Ex1: popcount(pkt) len(pkt) ≤ 3.4 or popcount(pkt) len(pkt) ≥ 4.6.

Ex2: The first six (or more) bytes of pkt are [0x20,0x7e].

Ex3: More than 50% of pkt’s bytes are [0x20,0x7e].

Ex4: More than 20 contiguous bytes of pkt are [0x20,0x7e].

Ex5: It matches the protocol fingerprint for TLS or HTTP.

Block if none of the above hold."

6 comments

H8crilA 1083 days ago

It's extremely intuitive. You're trying to filter unusual, encrypted traffic.

First rule exploits the IND-CPA property of most encryption. You want to kill traffic that has about 4 bits set to 1 per byte, i.e. traffic that "looks random".

The following rules are exemptions for permissible encrypted or compressed traffic (note that compression, while not IND-CPA, results in high entropy and thus will trigger the first rule).

This could work very well, which is confirmed by the researchers in this paper.

link

nerdbert 1083 days ago

Cool, I'll just base32 my encrypted traffic and sail through!

link

H8crilA 1082 days ago

Yep, or simply add "GET " in front of it. Best add it in front of all packets.

link

JPLeRouzic 1083 days ago

Thanks for the information, it is very interesting.

link

reaperman 1083 days ago

> I wonder if it was not found by the AI.

Do you mean "found" by the CCP, or "found" by the researchers? In the case of the CCP it was likely generated through basic statistical analysis, and tuned to minimize side effects and collateral damage below some threshold of acceptability (~0.6% of global traffic unintentionally blocked). In the case of the researchers, the paper details the basic statistical analysis used to discover these rules.

link

colanderman 1083 days ago

Ex1 is just excepting low-entropy packets (distribution of 1s and 0s tends toward the mean for high-entropy data). Encrypted data presents as high-entropy. This is a crude method (errs on the side of not excepting) but is very efficient for embedded hardware to compute.

Ex2-4 are just excepting ASCII text, which is used by many unencrypted protocols (e.g. IMAP), but which are high enough entropy that they statistically will fail the first test often.

Ex5 is necessary because TLS is high-entropy (by nature of being encrypted). HTTP is also excepted presumably so e.g. compressed uploads (e.g. images/video) aren't flagged.

That "low entropy" is the key to bypassing the GFW isn't surprising at all -- high entropy is all but a necessary feature of most cryptography schemes. (I say "all but" because -- encryption isn't adding information, so unless you compress before you encrypt, it's possible for a (hypothetical) encryption scheme to preserve entropy, according to several objective metrics. I don't know of any that do this, beside the meta scheme of compression before encrypting, followed by steganographically padding the encrypted data afterward. This of course leaks some information through the encryption -- equal to the negentropy of the message -- but it would typically be information that can't be gleaned from context, e.g. that the message is HTML+text.)

So... base64-encode your TLS?

link

tgv 1083 days ago

Looks more like it was found using random forests.

link

H8crilA 1083 days ago

Lol you guys never worked with real data :D

There's at least 1'000 such algorithms at each google-like company.

link

delfinom 1083 days ago

That looks hilariously easy to defeat though it will require introducing "0x20,0x7e" padding to protocols heh.

link

vbezhenar 1083 days ago

This is just some experimentation results, it's not algorithm.

link

amrocha 1083 days ago

An algorithm is just a bunch of rules to follow to perform an operation, so this looks like an algorithm to me.

link

netheril96 1083 days ago

You misunderstood your parent comment. What he/she meant is that the "algorithm" is only a guess from reverse engineering. The actual algorithm deployed at GFW can look significantly different.

link

amrocha 1082 days ago

Oh yeah, that's definitely not how I read it. Thanks!

link