Hacker News new | ask | show | jobs
by minimaxir 3778 days ago
I took a look at the code in the author's GitHub repository.

The data sources are CSVs in this repository: https://github.com/BobAdamsEE/SouthParkData/

Looks like all the data is preprocessed, with everyone mostly having only 1 line. (Actually, it appears the line you note in 10-3 is broken!) You can make an argument that the script isn't processed correctly, but that's beyond the scope of the analysis, although a note might be helpful.

1 comments

It's my repository. I'll look at how the python script handles flashback events later today. Thanks for the feedback!
It appears that there are two issues that affect small parts of the captured datasets:

1) Colored character names are not handled properly. I looked for <th> tags, not <th bgcolor="beige"> tags.

2) Character names that start with a lower case character are not handled. This may have to do with other episodes using lower case prefixed table headers for stage directions, I have to double check.