Hacker News new | ask | show | jobs
by inteoryx 1588 days ago
For the title, I did say "At Harvard" and not "In Harvard" or "By Harvard". I think the student newspaper is indeed at Harvard. I'm also pretty clear about what text I'm looking at in the article. In the title I used "Harvard" over "The Crimson" because I figured fewer people would know "The Crimson" compared to "Harvard".

Regarding the flaws in the article (phew, I'm glad I didn't quite reach criminal level) - I'm curious which assumptions you think are horrible or what data shoddy. I don't think, for example, that I'm assuming the latent space model is "correct" as you say. I don't think I really have any significant assumptions about the technique or the meaning behind it. I read about the technique in the linked paper and reproduced it in my blog with a different dataset and found a similar result. It's strange to me that the signal produced by this technique is as consistent as it is across the 120 years of data. Beyond that, I'm pretty explicit that I don't know what it means or why it happens.

Regarding the "arbitrarily fit" line - as I say explicitly in the post, that's a regression plot to illustrate the trend.

Regarding the possibilities that The Crimson has more articles per year - it's true that's possible. It's not reality, they run about (for a generous definition of "about") the same number of articles every year. The articles do get longer over time. Either way, it's not clear to me what impact this should have on average cosine distance.

There are a lot of things that I looked at that didn't make it into the blog post. Without including them, then perhaps it looks like I'm cutting corners. If I did include them then I think the blog post would be shooting off in many directions. For example, I considered that political violence might be related - like maybe, in times where there's lots of political violence elite institutions come together and their language becomes more similar. That didn't really pan out though. I graphed a bunch of things that ultimately I decided didn't contribute very much and did not include.

Another way of thinking about it is in the original article Rasmussen (the original author) says "Look at this elite writing in NSF grants. The cosine distance is decreasing over time." I then say "Here is some elite writing - student newspaper at an elite school. Is the cosine distance decreasing there over time too?" And, it is. That's what the blog post is trying to say.

Now, maybe the latent space is "incorrect" - although Rasmussen and I use different embeddings that find a similar trend. Maybe it's not meaningful to use cosine distance in this context. But, it does seem like something has to cause it. Whatever it is and whatever it means, it doesn't look like the kind of thing that happens entirely by chance because it is consistent in different datasets and over many years.

3 comments

Thank you for your article! Anecdotally, this is a phenomenon I see in high pressure service companies, like McKinsey’s of this world: there’s a very restricted, idiosyncratic vocabulary used by people working there driven by the idea that it will promote sales.
Isn't the pre-made model you're using trained almost entirely on recent (last decade or so) text? I didn't dig too far into it but it looks like news, web crawls, twitter, wikipedia, etc.
Without commenting on the overall trend's cause, your diversity hypothesis is bunk and suggests you are looking to making things fit a diversity-related narrative:

- there's (unsurprisingly) no significant diversity-word change from 1900 to 1940 but a very significant distance drop

- there's a big diversity-word change around ~1990 with no concomitant distance change

Your comment is a bit ironic in the sense that I can tell you didn't read the article because you reproduce conclusions from the article. That's okay! Obviously you didn't need to read it to know what I would have said. :)

Let me quote from the end:

"Another argument against connecting distance and diversity is that distance is on a long running decline from 1900 even for the first four decades while diversity words were basically flat. When diversity words pop in the 90's there isn't an immediate reaction in cosine distance, it's only about a decade later, in 2000, that cosine distance takes a steep drop."

That seems awfully similar to the two points you've raised here.

What I do find a bit distasteful is that you jump in with "your diversity hypothesis is bunk" and accuse me of trying to fit a narrative - without even reading what you're commenting on.

Hey, I thought your article was nice. First it had an easy intro to word embeddings and cosine similarity. And second, you followed the investigation and even came up with the idea "against connecting distance and diversity", so it didn't seem you had the conclusion before you started the work.

If someone complains about not mentioning variance - it's still implicitly visible by the cloud of dots representing each year around the regression line.