Not milliseconds, but AudioLM [1] already does it with just seconds, for speech (and piano). Results are already very convincing (to me).
[1] https://google-research.github.io/seanet/audiolm/examples/