| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by minimaxir 3825 days ago

I took a look at the code in the author's GitHub repository.

The data sources are CSVs in this repository: https://github.com/BobAdamsEE/SouthParkData/

Looks like all the data is preprocessed, with everyone mostly having only 1 line. (Actually, it appears the line you note in 10-3 is broken!) You can make an argument that the script isn't processed correctly, but that's beyond the scope of the analysis, although a note might be helpful.

1 comments

bobadams5 3825 days ago

It's my repository. I'll look at how the python script handles flashback events later today. Thanks for the feedback!

link

bobadams5 3824 days ago

It appears that there are two issues that affect small parts of the captured datasets:

1) Colored character names are not handled properly. I looked for <th> tags, not <th bgcolor="beige"> tags.

2) Character names that start with a lower case character are not handled. This may have to do with other episodes using lower case prefixed table headers for stage directions, I have to double check.

link