I think the benefit of a discrete optocoupler is in keeping the communication point-to-point, so no other device (malicious or otherwise) can "listen in". A low-power light signal won't penetrate a solid enclosure; it's much harder to prevent mechanical vibrations from leaking information beyond the coupler - you'd need to keep the speaker and microphone on some kind of suspension (springs and shock absorbers) acting as a low-pass filter.
All speakers can act as microphones. But due to physics you'd have a much harder time turning a photodiode into a light emitting one (the physics means you only can get IR out and the LED can't receive anything that way).
> the physics means you only can get IR out and the LED can't receive anything that way
Gut feeling tells me there is a way, if you use way more power than normal for this :). Much like with making speakers receive sound (you need to amplify the received signal afterwards) and making microphones produce it.
But it doesn't really matter whether or not you can reverse the analog signal flow, if the digital side treats the I/O pins as unidirectional.
If the digital side could be trusted we'd just set it to send only mode and be sure it'll behave - in reality we don't trust it.
The threat model where you use a data diode presumes an adversary might totally compromise the sending side - the guarantee you're trying to add is that whatever malware they push down the line has no ability to exfiltrate data regardless of how compromised it is.
Shannon-Hartley says the theoretical maximum data rate for a channel with AWGN is proportional to bandwidth and the log of signal-to-noise ratio. For an off-the-shelf microphone/speaker pair, I think 16 kHz and 80 dB are probably decent guesses. That would give a theoretical maximum data rate of about 425 kb/s. The practical limit is probably much lower.
It may be possible to increase the bandwidth by increasing the sample rate on both ends, but this quickly leaves the realm of consumer audio equipment (and consumer pricing). At some point you'd exceed the reasonable frequency responses for each device, as well as the medium. I imagine that air attenuates ultrasonic frequencies more than lower ones, but that's just a guess.