| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cool_shit 3351 days ago

Sometimes data is not beautiful, but very ugly. These results are based on a flawed premise. One red flag -- where is the word "dank" in your list? Where are words used by people who actually smoke weed? Also, is "where score > 100" a good heuristic for this kind of study? I would argue that "where score < 100" is a better heuristic.

For example, a shill or superuser (people getting top comment) will not be using domain specific language -- they will be using language that caters to a general audience. If this is true, you would end up squeezing most of the interesting language out of your study. Have you been to Grass City forums? I am guessing these people surely aren't using terms like "Donald Trump" in their everyday conversations about weed.

Reddit is a huge melting pot and probably isn't a good place for insight about potheads. Grass City might not be either -- Grass City users are not typical potheads. The best place would be 10th grade high school social circles and college dorms. It really is amazing how little data is produced by social networks, in the grand scheme of things. We are all so used to hearing about how much data is produced by the internet. There are orders more data in the raw world just waiting to be scooped up.