Hacker News new | ask | show | jobs
by the8472 3463 days ago
> 1) You are now assuming that "seeking to a position will produce the same output as decoding to a position"; even if the video is well-formed (and you don't end up with massive issues where the key frames just don't work correctly) you are likely going to end up with subtle discontinuities between every segment.

Wouldn't "the keyframes just don't work correctly" result in corrupted output anyway?

If we're worrying about already-broken situations then it is quite obvious that additional breakage may occur in related features.

2 comments

I think the point is that video definitely is that broken and the only reason video does work is because everyone has work-arounds for everyone else's bugs. At least that's my experience with video. It's all a disaster.
Yes, this. Working with video is as though there were no such thing as a documented API or standards document, but instead, you find the longest-lived bugs in the popular toolchains and in the clients of your customers, and those bugs are the foundation of the interfaces you implement.
I believe[1] this isn't necessarily about broken files. There is a lot of variation allowed by the spec. One example that I've seen in the wild is extra-long (> 60 seconds) periods between I-frames. Seeking to an arbitrary point either requires searching backwards from the seek-point for an I-frame and storing a massive amount of RAM. As this usually isn't possible and would require decoding hundreds of frames, decoding may cheat and make do with as many P and B frames as it can handle.

[1] I haven't actually read most of the h.265 spec. It's possible these are technically invalid files.

a 1-minute span for I-frames would not be prohibitive for parallel processing that the quotes part was referring to, with a 60-minute video it would still give you 60 segments to process in parallel.
A single uncompressed frame of 1080p video occupies 28MB in RAM, so 1 minute of 24fps video will take up 40GB. If you want to be able to run 4 cores at once it's 3 times that. You won't be doing that any time soon on your laptop or smartphone.
Curious as to your math? My naive thinking is 1920 * 1080 * 8 (generous) bytes is around 16MB.
I forgot where I got 28 from but it's indeed a mistake. For normal display you could get away with 1920 * 1080 * 8bit = 6MB. For a 10bit display it would be around 8MB. You do indeed often use 32bit float for high-quality processing but since what we're storing here is the output frame you would finish all that processing and then go down to 8 or 10bit per channel. So recalculating the math that's 8GB for 1 minute of video, still way too impractical.
I think the grandparent post is talking about decoding to RGB with a full 32-bit float per channel, which is 12 bytes per pixel rather than 8. The high precision is needed for HDR and for the extra processing you have to do to the pixels after they're decodeed - motion compensation, gamma correction, etc.
The maximum number of references frames, i.e. how much the Decoded Picture Buffer has to hold, is 16. So even if a GOP is 1 minute long you would have to hold at most 16 pictures in memory to have enough information to stream over that 1-minute segment.

So I still do not see how this would prohibit parallel processing.

Not sure how that would work. You have a thread that's decoding the frames 1 minute in front of where playback is, so if you're not decoding full frames and storing them until you need to display them what is that thread doing?
Many of the listed points in TFA are not about broken-ness. A good chunk cover rarely-used features or less commonly used codecs for advanced applications.
As an example, there exist bitstreams where there aren't actually any keyframes, but instead the encoder guarantees that the decoder output converges to correct after decoding some number of frames. It's actually kinda how MDCT audio codecs work; it's just very rare in video.