| The syncing is done automatically, at least mostly. TL;DR: ScreenplaySubs fetches the subtitles from Netflix, parses the PDF-formatted screenplays into JSON, and syncs by calculating the sentence similarities between subtitle and screenplay dialogue. In particular, we use the Universal Sentence Encoder for deciding whether a subtitle matches with a screenplay dialogue. If a screenplay dialogue is similar enough with the subtitles, the former will be tagged with the timestamp provided by the latter. A lot of the underlying problems presented with each step sounds deceptively simple at first, but turns out to be quite challenging and fun to research. E.g. Parsing PDFs in general are not straightforward (https://filingdb.com/b/pdf-text-extraction), and there’s only a handful of resources on parsing PDF screenplays beside a handful of research papers (https://github.com/drwiner/ScreenPy/blob/master/INT17_screen...), which lead us to create our own open source repo for this (https://github.com/SMASH-CUT/screenplay-pdf-to-json). Our screenplay-pdf-to-JSON converter is able to contain all dialogues, transitions, actions within a particular screenplay scene. With this, we’re treating scenes as atomic, being able to detect changes in scene ordering based on the tagged scene timestamps. This also means if dialogues are swapped within a scene in the movie, there will be some syncing inconsistencies. Some scenes do have little to no dialogues, which would pretty much cause the extension to work on a best-effort basis. E.g. The opening scene of There Will Be Blood has very minimal if not no dialogue at all. This is the case where I need to jump in and sync up the screenplay manually. OTOH, the opening scene of Inglourious Basterds will work very well, since there are tons of dialogues in it. This is the reason why I can’t just add movies and instantly upload it to the site. Would you be interested for me to get into more details? I was thinking of writing a series of technical blog posts if there are enough interests! |
Over the last several years I've imagined a lot of projects (both serious utilities, and the absurd/artistic) in roughly the territory you're exploring...
- For my MFA thesis (2012) I used plaintext (thankfully, though they had plenty of their own problems) transcripts of a TV show as a corpus for generating poems from, and at the time I thought it would be an interesting follow-up project to turn them back into video clips.
- Mapping film quotes/citations back to the script/film and accuracy-checking movie quotes. (can imagine both of these being useful for film forums like the movies/sci-fi stack-exchange sites).
- Generating script-cuts of movies that re-order/drop scenes and just show the printed script on-screen where scenes were cut.
- A film-analysis/screenwriting-class sort of interface oriented around reading a segment and then playing it (could be particularly interesting when there happen to be multiple known script drafts?)
- Re-constructing a character monologue from lines spoken by an actor that turned down the role.
- Generating a super-cut of actor X saying Y.
- Generating focused cuts of a film that cover, say, every scene a given character does/doesn't appear in, or every scene that mentions X.