That is still way too much, a 4k RGB 8 bit frame is about 25 MB, and many frames could be operated on at once, but I doubt the equivalent of around 2000 uncompressed 4k frames (less depending on 10 bit color) need to be in memory all at once.
Scaling video encode to 112 CPU cores is hard. I haven't looked too hard into this encoder but the normal method to scale that high is to encode entire segments in parallel. (YouTube in particular supposedly does each segment single-threaded which is why libvpx has terrible scaling.) Which effectively means encoding up to 112 independent 4k streams.
Each stream could need:
- one source frame
- additional source frames for reordering (3-7 is pretty
normal)
- additional source frames for rate control (x264's default is 40)
- recon for the frame being encoded
- reference frames (IIRC AV1 allows up to 8 to be stored)
Plus MVs, modes, maybe subpel caches, etc.
That's easily 50-60 frames per stream. Times maybe 112 streams for 6000 frames. Easily tunable of course, especially with even a little intra-segment parallelism.
I understand how an encoder could eat up so much memory and justify it in some way, but I can't buy that it's a neccesity or even acceptable in the long run (maybe this is stated to be in the prototype stage).
From what I've seen AV1 breaks frames/segments up into a kd-tree and brute forces these leaves to find the transformation that looks the best with the smallest size. An over simplification obviously, but with everything that encoders are doing I still think it is naive to design them with such a simplistic view of concurrency that they have to be treated as a hundred small files for a hundred CPU cores.