Hacker News new | ask | show | jobs
by perplexes 3640 days ago
This is wonderful. I read the whole series in one sitting. It actually made video codecs feel way more approachable, rather than some patented black box magic I'll never understand.

It also reminded me of a recent article talking about how you can break audio codecs by guessing which quantizer was used by the packet, then using it in reverse to produce speech! Which I suppose is obvious in retrospect, that lossy codecs are trying to compress data by making it perceptually similar, whatever the domain.

I also appreciated the ties to video game networking. Gaffer on Games has had a long-running series on designing multiplayer networking protocols with UDP and you two approach bit-shaving very similarly (unsurprisingly I suppose - it's a very specific process with its own tools).

Anyway, thank you! I learned a lot.

2 comments

It was a blast to write. Glenn is a smart guy with some great content around game networking. There are good ways to do networking for games and other real-time applications and TCP isn't really one of them.
Yup, if your data is time-sensitive and newer data preempts older TCP is almost always the wrong protocol to use.
What are your thoughts on ENet?
Good not great.
Anything better than ENet I could investigate?
Gaffer on Games, OpenTNL are good places to look.
I've had satisfying results with ENet. Would love to hear about similar C and light alternatives and/or why ENet is not great.
> you can break audio codecs by guessing which quantizer was used by the packet, then using it in reverse to produce speech!

Could you explain that using different words?

Here's the paper - https://www.cs.jhu.edu/~cwright/oakland08.pdf

This is a variant of "should you compress or encrypt first?"

Compression relies on pattern matching, and compressed size will leak details about what you compressed, even if that result is encrypted. (Unless you then pad the encrypted size, but then what was the point of compressing? I can see some more or less secure ways to do this like establishing a compression ratio/bandwidth/entropy limit, then padding and achieving that constraint so each encrypted payload looks more or less the same, but latency sensitivity makes this difficult)

In the case of VOIP, the codec uses a lookup table for distinct parts of speech (tch, sp, buh, etc). Then "all it has to send" is table cell numbers around (certainly not all). On the receiving side, you just look in your speech table and reconstruct.

These values have distinct output patterns, particularly when compressed. If you can guess better than 70% of the time (I forget the exact number they achieved) what table value was used, then reconstruct it, you can listen in on what they're saying, without having to break the underlying encryption.

Voice codecs are also awful at encoding music which may explain why when you're on hold, the hold music may just be dropped and replaced with white noise because it's reached some bandwidth cap. C.f. video encoding and falling snow.

Hey, cool. Always nice to see this work show up on HN. But I don't think this is the paper you're looking for. In '08, we could only spot phrases that we knew in advance, and they had to be at least a certain length.

The most impressive results -- going from encrypted VoIP to text -- were done by Andy White and others, a couple years after the paper you linked above. It's this one:

A.M. White, A.R. Matthews, K.Z. Snow, and F. Monrose. "Phonotactic Reconstruction of Encrypted VoIP Conversations: Hookt on fon-iks." In Proceedings of IEEE S&P, 2011. http://www.cs.unc.edu/~fabian/papers/foniks-oak11.pdf

https://news.ycombinator.com/item?id=11995441 Describes the premise. There is more detail around that comment in the thread that will address your question