Hacker News new | ask | show | jobs
by mikepavone 180 days ago
> When the network is bad, you get... fewer JPEGs. That’s it. The ones that arrive are perfect.

This would make sense... if they were using UDP, but they are using TCP. All the JPEGs they send will get there eventually (unless the connection drops). JPEG does not fix your buffering and congestion control problems. What presumably happened here is the way they implemented their JPEG screenshots, they have some mechanism that minimizes the number of frames that are in-flight. This is not some inherent property of JPEG though.

> And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB. We’re sending LESS data per frame AND getting better reliability.

h.264 has better coding efficiency than JPEG. For a given target size, you should be able to get better quality from an h.264 IDR frame than a JPEG. There is no fixed size to an IDR frame.

Ultimately, the problem here is a lack of bandwidth estimation (apart from the sort of binary "good network"/"cafe mode" thing they ultimately implemented). To be fair, this is difficult to do and being stuck with TCP makes it a bit more difficult. Still, you can do an initial bandwidth probe and then look for increasing transmission latency as a sign that the network is congested. Back off your bitrate (and if needed reduce frame rate to maintain sufficient quality) until transmission latency starts to decrease again.

WebRTC will do this for you if you can use it, which actually suggests a different solution to this problem: use websockets for dumb corporate network firewall rules and just use WebRTC everything else

6 comments

They shared the polling code in the article. It doesn't request another jpeg until the previous one finishes downloading. UDP is not necessary to write a loop.
> They shared the polling code in the article. It doesn't request another jpeg until the previous one finishes downloading.

You're right, I don't know how I managed to skip over that.

> UDP is not necessary to write a loop.

True, but this doesn't really have anything to do with using JPEG either. They basically implemented a primitive form of rate control by only allowing a single frame to be in flight at once. It was easier for them to do that using JPEG because they (to their own admission) seem to have limited control over their encode pipeline.

> have limited control over their encode pipeline.

Frustratingly this seems common in many video encoding technologies. The code is opaque, often has special kernel, GPU and hardware interfaces which are often closed source, and by the time you get to the user API (native or browser) it seems all knobs have been abstracted away and simple things like choosing which frame to use as a keyframe are impossible to do.

I had what I thought was a simple usecase for a video codec - I needed to encode two 30 frame videos as small as possible, and I knew the first 15 frames were common between the videos so I wouldn't need to encode that twice.

I couldn't find a single video codec which could do that without extensive internal surgery to save all internal state after the 15th frame.

A 15 frame min anf max GOP size would do the trick, then you'd get two 15 frame GOPs. Each GOP can be concatenated with another GOP with the same properties (resolution, format, etc) as if they were independent streams. So there is actually a way to do this. This is how video splitting and joining without re encoding works, at GOP boundary.
In my case, bandwidth really mattered, so I wanted all one GOP.

Ended up making a bunch of patches o libx264 to do it, but the compute cost of all the encoding on CPU is crazy high. On the decode side (which runs on consumer devices), we just make the user decode the prefix many times.

> I couldn't find a single video codec which could do that without extensive internal surgery to save all internal state after the 15th frame.

fork()? :-)

But most software, video codec or not, simply isn't written to serialize its state at arbitrary points. Why would it?

A word processor can save it's state at an arbitrary point... That's what the save button is for, and it's functional at any point in the document writing process!

In fact, nearly everything in computing is serializable - or if it isn't, there is some other project with a similar purpose which is.

However this is not the case with video codecs - but this is just one of many examples of where the video codec landscape is limiting.

Another for example is that on the internet lots of videos have a 'poster frame' - often the first frame of the video. That frame for nearly all usecases ends up downloaded twice - once as a jpeg, and again inside the video content. There is no reasonable way to avoid that - but doing so would reduce the latency to play videos by quite a lot!

> A word processor can save it's state at an arbitrary point... That's what the save button is for, and it's functional at any point in the document writing process!

No, they generally can't save their whole internal state to be resumed later, and definitely not in the document you were editing. For example, when you save a document in vim it doesn't store the mode you were in, or the keyboard macro step that was executing, or the search buffer, or anything like that.

> In fact, nearly everything in computing is serializable - or if it isn't, there is some other project with a similar purpose which is.

Serializable in principle, maybe. Actually serializable in the sense that the code contains a way to dump to a file and back, absolutely not. It's extremely rare for programs to expose a way to save and restore from a mid-state in the algorithm they're implementing.

> Another for example is that on the internet lots of videos have a 'poster frame' - often the first frame of the video. That frame for nearly all usecases ends up downloaded twice - once as a jpeg, and again inside the video content.

Actually, it's extremely common for a video thumbnail to contain extra edits such as overlayed text and other graphics that don't end up in the video itself. It's also very common for the thumbnail to not be the first frame in the video.

> A word processor can save it's state at an arbitrary point...

As ENTIRE STATE. Video codecs operate on essentially full frame + stream of differences. You might say it's similar to git and you'd be incorrect again, because while with git you can take current state and "go back" using diffs, that is not the case for video, it alwasy go forward from the keyframe and resets on next frame.

It's fundamentally order of magnitude more complex problem to handle

I'm on a media engineering team and agree that applying the tech to a new use case often involves people with deep expertise spending a lot of time in the code.

I'd guess there are fewer media/codec engineers around today than there were web developers in 2006. In 2006, Gmail existed, but today's client- and server-side frameworks did not. It was a major bespoke lift to do many things which are "hello world" demos with a modern framework in 2025.

It'd be nice to have more flexible, orthogonal and adaptable interfaces to a lot of this tech, but I don't think the demand for it reaches critical mass.

> It was a major bespoke lift to do many things which are "hello world" demos with a modern framework in 2025.

This brings back a lot of memories -- I remember teaching myself how to use plain XMLHTTPRequest and PHP/MySQL to implement "AJAX" chat. Boy was that ugly JavaScript code. But on the other hand, it was so fast and cool and I could hardly believe that I had written that.

I started doing media/codec work around 2007 and finding experienced media engineers at the time was difficult and had been for quite some time. It's always been hard - super specialized knowledge that you can only really pick up working at a company that does it often enough to invest in folks learning it. In my case we were at a company that did desktop video editing software so it made sense, but that's obviously uncommon.
I wonder if we could scan / test / dig these hidden features somehow ; like in a scrapping / fuzz fashion
So US->Australia/Asia wouldn't that limit you to 6fps or so due half-rtt? Each time a frame finishes arriving you have 150ms or so for your new request to reach.
That sounds find for most screen sharing use-case.
Probably either (1) they don't request another jpeg until they have the previous one on-screen (so everything is completely serialized and there are no frames "in-flight" ever) (2) they're doing a fresh GET for each and getting a new connection anyway (unless that kind of thing is pipelined these days? in which case it still falls back to (1) above.)
You can still get this backpressure properly even if you're doing it push-style. The TCP socket will eventually fill up its buffer and start blocking your writes. When that happens, you stop encoding new frames until the socket is able to send again.

The trick is to not buffer frames on the sender.

You probably won't get acceptable latency this way since you have no control over buffer sizes on all the boxes between you and the receiver. Buffer bloat is a real problem. That said, yeah if you're getting 30-45 seconds behind at 40 Mbps you've probably got a fair bit of sender-side buffering happening.
> you have no control over buffer sizes on all the boxes between you and the receiver

You certainly do; the amount of data buffered can never be larger than the actual number of bytes you've sent out. Bufferbloat happens when you send too much stuff at once and nothing (typically the candidate to do so would be either the congestion window or some intermediate buffer) stops it from piling up in an intermediate buffer. If you just send less from userspace in the first place (which isn't a good thing to do for e.g. a typical web server, but _can_ be for this kind of video conference-like application), it can't pile up anywhere.

(You could argue that strictly speaking, you have no control over the _buffer_ sizes, but that doesn't matter in practice if you're bounding the _buffered data_ sizes.)

Related tangent: it's remarkable to me how a given jpeg can be literally visually indistinguishable from another (by a human on a decent monitor) yet consist of 10-15% as many bytes. I got pretty deep into web performance and image optimization in the late 2000s and it was gratifying to have so much low-hanging fruit.
> Still, you can do an initial bandwidth probe and then look for increasing transmission latency as a sign that the network is congested. Back off your bitrate (and if needed reduce frame rate to maintain sufficient quality) until transmission latency starts to decrease again.

They said playing around with bitrate didn't reduce the latency; all that happened was they got blocky videos with the latency remaining the same.

I am almost sure that the most perfect solution would involve using a video codec protocol but the issue is implementation complexity and having to implement a production encoder yourself if your usecase is unusual.

This is exactly the point of the article they tried keyframes only but their library had a bug that broke it

Regarding the encoding efficiency, I imagine the problem is that the compromise in quality shows in the space dimension (aka fewer or blurry pixels) rather than in time. Users need to read text clearly, so the compromise in the time dimension (fewer frames) sounds just fine.
Nothing stopping you from encoding h264 at a low frame rate like 5 or 10 fps. In webRTC, you can actually specify how you want to handle low bitrate situations with degredationPreference. If set to maintain-resolution, it will prefer sacrificing frame rate.