Have you seen https://www.descript.com/ it transcribes video and allows you to edit the transcript. Those edits to the transcript are reflected in the video. You can even train it to voices if you have enough content.
I mean you can always write an EDL. The issue with most filmic material however is that moving pictures have their own internal timing, movments, directional changes and so on. Ignoring the content of shots is something that you might do with long shots of talking heads, but literally everything else won't let you get away with it in my experience.
This would be particularly useful for speeches or presentations where the content is the important part and the visuals don't change much (or wouldn't create jarring cuts when just editing based on the transcript).