Hacker News new | ask | show | jobs
Words growing or shrinking in Hacker News titles: a tidy analysis (varianceexplained.org)
118 points by var_explained 3298 days ago
7 comments

A frequent post then would be:

"Using VR to train a deep learning neural network on driving and react correctly to unexpected conditions, a bot implemented via a microservices stack using aws as a container and of course connected with cars and related traffic devices via the IoT, logging unexpected events into a blockchain."

Missing "HN" somewhere.
Just by eyeballing it, I am pretty sure that exceeds 80 chars.
I'm surprised both "NSA" and "surveillance" are two of the fastest shrinking words. I thought we saw more now than ever. Shows how perception doesn't always match reality.
When the Snowden leaks first dropped, the front page was absolutely overwhelmed with NSA news, to the exclusion of nearly everything else. Would not be possible to keep that level of interest up without making this is an exclusively NSA/surveillance-driven site.
IIRC the mods also soft banned it because of that. So posts with the word "NSA" in the title get penalized and ranked much lower than other posts. Hence the shrinking.
We did that for a while but stopped already years ago. It was only needed until the story barrage leveled off.
Hmm, the BigQuery HN dataset is now updated daily and contains comments as well as stories? That's new, and I'll certainly give it another look at for my projects.

With the bigrquery R package (https://github.com/rstats-db/bigrquery), you can access the HN dataset directly from R, using dplyr syntax too. (for simple queries atleast; you can pass the raw SQL for complex queries)

As noted, the resulting dataset of words is large, so mapping the words in BigQuery itself may be more practical (using a combo of SPLIT and UNNEST with standard SQL), although of course you can't do complex operations like logistic regression or splines there.

>I don’t currently have a guess for why “million” and “billion” had sudden dropoffs in 2014. Is it some artifact of the Hacker News policy, with the word becoming edited or deleted in newer posts? Or is it a real change in what the site discusses?

Any guesses on this one?

A more interesting analysis would be comment length.
In my old analysis (http://minimaxir.com/2014/10/hn-comments-about-comments/), it's not that interesting.

Comments are getting longer over time on average (http://minimaxir.com/img/hn-comments/monthly_average_words.p...), and there is a slight positive correlation between comment score and comment length (http://minimaxir.com/img/hn-comments/distribution_comment_po...), but that can't be remade with the BigQuery dataset since comment scores are no longer public.

It would be nice to see a comparison of fastest growing words between the last 5 years vs 10 years ago. I'm wondering about the demographics of this site and if they've changed.
I am extremely surprised rust wasn't included in here.
I've got a followup coming about what words lead to upvotes, and rust features quite prominently there!