Google provides close captioning for meet calls. That means it's not E2E. Also pretty much no service can provide multi-party video call with adaptive quality without completely destroying your bandwidth.
I'm interested in knowing more about why closed captions would imply not end-to-end encrypted. Wouldn't it be possible to build a model and distribute the model with the client-side application, and run it at the edge?