| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by manca 715 days ago

When I read lossless, I immediately thought about the editing of the real lossless formats like ProRes, MJPEG2000, HuffYUV, etc. But what this ultimately does it remuxes the original container in a new one without touching the elementary stream (no reencoding).

It's no wonder that it uses FFMpeg to do the heavy-lifting, but I think it's worthwhile for the community to understand how this process ultimately works.

In a nutshell, every single modern video format you know about - mp4, mov, avi, ts, etc - is ultimately the extension of the container that could contain multiple video and audio tracks. The tracks are called Elementary Streams (ES) and they are separately encoded using appropriate codecs such as H264/AVC, H265/HEVC, AAC, etc. Then during the process called "muxing" they are put together in a container and each sample/frame is timestamped, so the ESes can be in sync.

Now, since the ES is encoded, you don't get frame-level accuracy when seeking for example, because the ES is compressed and the only fully decodable frame is an I-Frame. Then every subsequent frame (P, or B) is decoded based on the information from the IFrame. This sequence of IPPBPPB... is called GOP (Group of Pictures).

The cool part is that you could glean the type of the frame, even though it's encoded by looking into NAL units (Network Abstraction Layer), which have specific headers that identify each frame type or picture slice. For example for H264 IFrame the frame-type byte is like 0x07, while the header is 0x000001.

Putting all this together, you could look into the ES bitstream and detect GOP boundaries without decoding the stream. The challenge here is of course that you can't just cut in the middle of the GOP, but the solution for that is to either be ok with some <1sec accuracy, or just decode the entire GOP which is usually 30 frames and insert an IFrame (fully decoded frame can be turned into an IFrame) in the resulting output. That way all you do is literally super fast bit manipulation and copy from one container into another. That's why this is such an efficient process if all you care about is cutting the original video into segments.