Hacker News new | ask | show | jobs
by hailwren 1589 days ago
This article is almost criminally flawed. The author makes some horrible assumptions and presents shoddy data. Even if we assume that the latent space model is "correct" (we shouldn't), they don't present anything like variance or number of samples in a year. Then they sort of arbitrarily fit a line to data which pretty clearly looks non-linear.

Suppose, for example, that The Crimson (not Harvard btw, it's a student newspaper) runs 10x as many articles this year as last. It's possible you're going to get a huge reduction in cosine distance just by virtue of a few authors producing a lot more content.

At a minimum, we need mean, variance, and number of samples. This doesn't tell you anything about "Harvard", it just tells you about the students who the Crimson choose to publish. There are lots of structural reasons within Harvard that the Crimson has probably stopped being a unified voice of the student body -- but again, that's not what the article purports to show.

1 comments

For the title, I did say "At Harvard" and not "In Harvard" or "By Harvard". I think the student newspaper is indeed at Harvard. I'm also pretty clear about what text I'm looking at in the article. In the title I used "Harvard" over "The Crimson" because I figured fewer people would know "The Crimson" compared to "Harvard".

Regarding the flaws in the article (phew, I'm glad I didn't quite reach criminal level) - I'm curious which assumptions you think are horrible or what data shoddy. I don't think, for example, that I'm assuming the latent space model is "correct" as you say. I don't think I really have any significant assumptions about the technique or the meaning behind it. I read about the technique in the linked paper and reproduced it in my blog with a different dataset and found a similar result. It's strange to me that the signal produced by this technique is as consistent as it is across the 120 years of data. Beyond that, I'm pretty explicit that I don't know what it means or why it happens.

Regarding the "arbitrarily fit" line - as I say explicitly in the post, that's a regression plot to illustrate the trend.

Regarding the possibilities that The Crimson has more articles per year - it's true that's possible. It's not reality, they run about (for a generous definition of "about") the same number of articles every year. The articles do get longer over time. Either way, it's not clear to me what impact this should have on average cosine distance.

There are a lot of things that I looked at that didn't make it into the blog post. Without including them, then perhaps it looks like I'm cutting corners. If I did include them then I think the blog post would be shooting off in many directions. For example, I considered that political violence might be related - like maybe, in times where there's lots of political violence elite institutions come together and their language becomes more similar. That didn't really pan out though. I graphed a bunch of things that ultimately I decided didn't contribute very much and did not include.

Another way of thinking about it is in the original article Rasmussen (the original author) says "Look at this elite writing in NSF grants. The cosine distance is decreasing over time." I then say "Here is some elite writing - student newspaper at an elite school. Is the cosine distance decreasing there over time too?" And, it is. That's what the blog post is trying to say.

Now, maybe the latent space is "incorrect" - although Rasmussen and I use different embeddings that find a similar trend. Maybe it's not meaningful to use cosine distance in this context. But, it does seem like something has to cause it. Whatever it is and whatever it means, it doesn't look like the kind of thing that happens entirely by chance because it is consistent in different datasets and over many years.

Thank you for your article! Anecdotally, this is a phenomenon I see in high pressure service companies, like McKinsey’s of this world: there’s a very restricted, idiosyncratic vocabulary used by people working there driven by the idea that it will promote sales.
Isn't the pre-made model you're using trained almost entirely on recent (last decade or so) text? I didn't dig too far into it but it looks like news, web crawls, twitter, wikipedia, etc.
Without commenting on the overall trend's cause, your diversity hypothesis is bunk and suggests you are looking to making things fit a diversity-related narrative:

- there's (unsurprisingly) no significant diversity-word change from 1900 to 1940 but a very significant distance drop

- there's a big diversity-word change around ~1990 with no concomitant distance change

Your comment is a bit ironic in the sense that I can tell you didn't read the article because you reproduce conclusions from the article. That's okay! Obviously you didn't need to read it to know what I would have said. :)

Let me quote from the end:

"Another argument against connecting distance and diversity is that distance is on a long running decline from 1900 even for the first four decades while diversity words were basically flat. When diversity words pop in the 90's there isn't an immediate reaction in cosine distance, it's only about a decade later, in 2000, that cosine distance takes a steep drop."

That seems awfully similar to the two points you've raised here.

What I do find a bit distasteful is that you jump in with "your diversity hypothesis is bunk" and accuse me of trying to fit a narrative - without even reading what you're commenting on.

Hey, I thought your article was nice. First it had an easy intro to word embeddings and cosine similarity. And second, you followed the investigation and even came up with the idea "against connecting distance and diversity", so it didn't seem you had the conclusion before you started the work.

If someone complains about not mentioning variance - it's still implicitly visible by the cloud of dots representing each year around the regression line.