|
|
|
|
|
by hailwren
1589 days ago
|
|
This article is almost criminally flawed. The author makes some horrible assumptions and presents shoddy data. Even if we assume that the latent space model is "correct" (we shouldn't), they don't present anything like variance or number of samples in a year. Then they sort of arbitrarily fit a line to data which pretty clearly looks non-linear. Suppose, for example, that The Crimson (not Harvard btw, it's a student newspaper) runs 10x as many articles this year as last. It's possible you're going to get a huge reduction in cosine distance just by virtue of a few authors producing a lot more content. At a minimum, we need mean, variance, and number of samples. This doesn't tell you anything about "Harvard", it just tells you about the students who the Crimson choose to publish. There are lots of structural reasons within Harvard that the Crimson has probably stopped being a unified voice of the student body -- but again, that's not what the article purports to show. |
|
Regarding the flaws in the article (phew, I'm glad I didn't quite reach criminal level) - I'm curious which assumptions you think are horrible or what data shoddy. I don't think, for example, that I'm assuming the latent space model is "correct" as you say. I don't think I really have any significant assumptions about the technique or the meaning behind it. I read about the technique in the linked paper and reproduced it in my blog with a different dataset and found a similar result. It's strange to me that the signal produced by this technique is as consistent as it is across the 120 years of data. Beyond that, I'm pretty explicit that I don't know what it means or why it happens.
Regarding the "arbitrarily fit" line - as I say explicitly in the post, that's a regression plot to illustrate the trend.
Regarding the possibilities that The Crimson has more articles per year - it's true that's possible. It's not reality, they run about (for a generous definition of "about") the same number of articles every year. The articles do get longer over time. Either way, it's not clear to me what impact this should have on average cosine distance.
There are a lot of things that I looked at that didn't make it into the blog post. Without including them, then perhaps it looks like I'm cutting corners. If I did include them then I think the blog post would be shooting off in many directions. For example, I considered that political violence might be related - like maybe, in times where there's lots of political violence elite institutions come together and their language becomes more similar. That didn't really pan out though. I graphed a bunch of things that ultimately I decided didn't contribute very much and did not include.
Another way of thinking about it is in the original article Rasmussen (the original author) says "Look at this elite writing in NSF grants. The cosine distance is decreasing over time." I then say "Here is some elite writing - student newspaper at an elite school. Is the cosine distance decreasing there over time too?" And, it is. That's what the blog post is trying to say.
Now, maybe the latent space is "incorrect" - although Rasmussen and I use different embeddings that find a similar trend. Maybe it's not meaningful to use cosine distance in this context. But, it does seem like something has to cause it. Whatever it is and whatever it means, it doesn't look like the kind of thing that happens entirely by chance because it is consistent in different datasets and over many years.