There's no need to have a cap bigger than a kilobyte though.
What you need is for the slow core of the algorithm to be fixed-speed.
Either by only reading the input bytes during initialization, or by only feeding a fixed number of input bytes into the core during each round.