Hacker News new | ask | show | jobs
by earksiinni 3119 days ago
I can't seem to comment anymore on the original article, so I'm going to write here in the hope that the author will see it.

Lots of people are commenting about the poor data set selection. I think that's understandable given that the author isn't a historian but rather a baseball geek and data scientist. Although I can't help but feel a little bit of anger to see (speaking as a software engineer and a history Ph.D. dropout) yet another example of a technical person blithely wandering into a field that has been studied for thousands of years and make very grandiose claims without even a cursory study of the field. Buy hey, that's tech for you, always disrupting (I mean that both sarcastically and not sarcastically at the same time).

What I want to address is more fundamental than getting the data set right, however. The author doesn't seem to understand that the very nature of the historical record is highly subjective.

1) Even if Wikipedia nailed every statistic, the statistics themselves about wins and losses, troop numbers, lengths of battle, places, etc. are increasingly unreliable as you go back in time. In some texts and historiographical traditions, the numbers are not just unreliable, they're arguably cut from whole cloth. Anything past, say, 1000 A.D. in Western Europe, for example, is highly disputed. Biases in old texts aside, we have enormous gaps in what texts have survived to this day. History isn't written by the victors, but the dried wood pulp and calf skins that it's written on is selectively preserved by them. The author's model has no awareness of the history of historiography.

(Tangentially, Napoleon, whom the author's model rates as the greatest general of all time, was indirectly responsible for a huge amount of destruction of Europe's archives after issuing orders to transfer archives from across the continent to Paris. Early modern logistics meant that huge portions of documents were destroyed in transport. I remember when I was working in the Vatican Secret Archives, something like 1/3rd of that archive was destroyed, and that's one of the main archives for European and world history.)

2) Even if we had all the numbers exactly perfect, what counts as a victory and what counts as a loss is highly subjective. Was the North Vietnamese Tet Offensive a victory or a loss? For whom? In what sense? These philosophical questions can't be answered by a model, at least not without the philosophical assumptions being made explicit.

This question is fundamentally one that requires a nuanced approach. I think that data-driven approaches can really help, but the author's model needs not only more refinement, it also needs to acknowledge more of the confounding factors involved. I encourage the author to keep working on it.