Well, there's diminishing returns in format shifting. The encoded barcode contains various types of quasi-redundant visual information (e.g. error correction codes) to allow decoding to happen, so for audio-based transfer it'd be better to skip the image encode and blast out the file directly.
That said... given the somewhat remarkable way fountain codes work, there's nothing stopping us from having a protocol that uses the audio and the video channels simultaneously for better throughput...
Would it be beneficial to use one channel (either audio or visual) to transmit the information and the other one for responses like acknowledge? So kind of like TCP over two different channels?
Well, as for acks specifically -- cimbar itself doesn't really need them, thanks to fountain codes [0].
But I can imagine a reverse (request?) channel being useful, if it had enough bandwidth for the desired application. :)
As /u/ggerganov notes elsewhere in this thread (with some expertise on the audio side -- I can't claim any), the bandwidth of any audio channel is probably going to be pretty bad.
edit: Notwithstanding how viable of an idea it might be, HTTP over audio+video would be pretty neat. :)
I don't think air-gapped audio transmission could ever reach a fast and reliable transmission so that it allows transferring files in reasonable amount of time over reasonable transmitter-receiver distances. It's just too many hardware and physical limitations for this approach.
Having said that, I am actually working on a small library for data-over-sound which can be used for small data chunk transmissions across the room [0].
That said... given the somewhat remarkable way fountain codes work, there's nothing stopping us from having a protocol that uses the audio and the video channels simultaneously for better throughput...