Hacker News new | ask | show | jobs
by nanis 3778 days ago
Also, I am going to go out on a limb here and guess that R's `read.csv` doesn't do what one hopes it would when fed this CSV:

    10,3,Brian,"You mean like the time you had tea with
    Mohammad, the prophet of the Muslim faith?
    Peter:
    Come on, Mohammad, let's get some tea.
    Mr. T:
    Try my ""Mr. T. ...tea.""
    "
Well, it seems people are not understanding the problem with this line. Here is the screenshot of the original script: http://imgur.com/pcu5N2U

    Brian: 	You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? [flashback #3]
    Peter: 	Come on, Mohammad, let's get some tea. [Mohammad is covered by a black box with the words "IMAGE CENSORED BY FOX" printed several times from top to bottom inside the box. They stop at a tea stand.]
    Mr. T: 	Try my "Mr. T. ...tea." [squints]
There, three characters speak.

However, R's read.csv will assign all three characters' speech to Brian: http://imgur.com/gLpPKdl

   > x[596, ]
       Season Episode Character
    596     10       3     Brian
              Line
    596 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \n

    > x[597,]
        Season Episode Character
    597     10       3     Brian
                                                Line
    597 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \nMr. T:\nTry my "Mr. T. ...tea." \n
as well as seemingly duplicating part of the conversation.

PS: In addition, both Muhammad and Mohammad appear, presumably under-counting the references to the prophet.

3 comments

I took a look at the code in the author's GitHub repository.

The data sources are CSVs in this repository: https://github.com/BobAdamsEE/SouthParkData/

Looks like all the data is preprocessed, with everyone mostly having only 1 line. (Actually, it appears the line you note in 10-3 is broken!) You can make an argument that the script isn't processed correctly, but that's beyond the scope of the analysis, although a note might be helpful.

It's my repository. I'll look at how the python script handles flashback events later today. Thanks for the feedback!
It appears that there are two issues that affect small parts of the captured datasets:

1) Colored character names are not handled properly. I looked for <th> tags, not <th bgcolor="beige"> tags.

2) Character names that start with a lower case character are not handled. This may have to do with other episodes using lower case prefixed table headers for stage directions, I have to double check.

why not? that's a valid single csv record with 4 "columns". When surrounded by quotes it IS legal for a csv entity to span multiple lines.
And, did you notice that the other lines comprise other characters' speech?
Just tested, it handles that fine. (R 3.1.3)
Sure, if you mean attributing Mr. T and Peter's speech to Brian is fine, then, yes, it handles it fine.