| For the title, I did say "At Harvard" and not "In Harvard" or "By Harvard". I think the student newspaper is indeed at Harvard. I'm also pretty clear about what text I'm looking at in the article. In the title I used "Harvard" over "The Crimson" because I figured fewer people would know "The Crimson" compared to "Harvard". Regarding the flaws in the article (phew, I'm glad I didn't quite reach criminal level) - I'm curious which assumptions you think are horrible or what data shoddy. I don't think, for example, that I'm assuming the latent space model is "correct" as you say. I don't think I really have any significant assumptions about the technique or the meaning behind it. I read about the technique in the linked paper and reproduced it in my blog with a different dataset and found a similar result. It's strange to me that the signal produced by this technique is as consistent as it is across the 120 years of data. Beyond that, I'm pretty explicit that I don't know what it means or why it happens. Regarding the "arbitrarily fit" line - as I say explicitly in the post, that's a regression plot to illustrate the trend. Regarding the possibilities that The Crimson has more articles per year - it's true that's possible. It's not reality, they run about (for a generous definition of "about") the same number of articles every year. The articles do get longer over time. Either way, it's not clear to me what impact this should have on average cosine distance. There are a lot of things that I looked at that didn't make it into the blog post. Without including them, then perhaps it looks like I'm cutting corners. If I did include them then I think the blog post would be shooting off in many directions. For example, I considered that political violence might be related - like maybe, in times where there's lots of political violence elite institutions come together and their language becomes more similar. That didn't really pan out though. I graphed a bunch of things that ultimately I decided didn't contribute very much and did not include. Another way of thinking about it is in the original article Rasmussen (the original author) says "Look at this elite writing in NSF grants. The cosine distance is decreasing over time." I then say "Here is some elite writing - student newspaper at an elite school. Is the cosine distance decreasing there over time too?" And, it is. That's what the blog post is trying to say. Now, maybe the latent space is "incorrect" - although Rasmussen and I use different embeddings that find a similar trend. Maybe it's not meaningful to use cosine distance in this context. But, it does seem like something has to cause it. Whatever it is and whatever it means, it doesn't look like the kind of thing that happens entirely by chance because it is consistent in different datasets and over many years. |