A/D conversion itself wouldn't add much at the sample level (say 48KHz sample, it's about 50us per sample). However, packetisation will - a 256 byte packet of 128 samples is 128*50us = 6.4ms right there at the transmitter, and the receiver won't notify until the full packet is received. So a naive digital approach would be 12.8ms (2x6.4ms) even before anything else.
A pure analogue approach (modulated RF) on the other hand shouldn't have any human-detectable delay - it's effectively distance/speed of light with a bit of a phase shift (addtional delay) introduced by the electronics - should be only a handful of microseconds in total.
128-sample buffers are already too large. To compare, the nRF24 has a max buffer of 32 bytes. However, even in your 128 samples, 16bit/48kHz example, latency is a bit better. It will take 1000/48000 * 128 ms to collect the 128 samples, or ~2.66ms. This amounts to 16*128 bits or 2k bits of information that the transmitter will have to send over. At the nRF24 2mbps rate, another ~1ms will be needed to send the buffer over. I'm not sure why you'd think that this time needs to be doubled at the receiver. Even if the nRF24 receiver started moving the buffer after it was fully received, it does so over a 10MHz serial connection, so that would be at most another 0.2ms, for a total of <4ms. For 16-sample buffers and 24bit/48kHz, the end-to-end latency is ~0.6ms.
Agreed, and that's why the "naive" is in my comment :) An even faster is to drop the packerisation and run the transmitter continuously with an appropriate code (self-clocking); the digital radio delay then drops to microseconds. That might in turn make the 48KHz 16-bit ADC seem the limit (21us per sample), one can always use a faster ADC (after appropriate front-end filtering). Out in the real world though error correction is needed, so generally need to use a codec with forward error correction.
That's what a 16-sample buffer would get you. In reality, there are many devices on the market that can get close to 2.5ms. For example, Line 6 claims 2.8 ms end-to-end, BOSS claims 2.3 ms [2], the NUX B8 is at 2.5ms [3].
Longer buffers allow for jitter in processing further upstream. If you get a buffer every 0.6ms, you need to be able to process it always within 0.6ms.
It worked that like for many, many years (radio mics) before digital came along. Think of FM radio - if you have a good signal, it's pretty resilient to interference and in a controlled short range environment it is extremely reliable.
A pure analogue approach (modulated RF) on the other hand shouldn't have any human-detectable delay - it's effectively distance/speed of light with a bit of a phase shift (addtional delay) introduced by the electronics - should be only a handful of microseconds in total.