| HN Mirror

> What about the static features (like account creation dates)? Aren't those overfitting with cross-validation?

Yes, absolutely. The paper admits that using only account creation time with KNN was enough to get 98% accuracy all on its own. The authors then broke creation time into hour and minute to increase difficulty (slightly), and introduced heavy data fuzzing to see how far they could go and still get results.

But in practice, that's not an ML problem. It's just asking "how little data is needed to perform recall on this dataset?", and finding that it's relatively easy to associate Twitter accounts with, um, themselves.

I think this could have been made interesting by not just fuzzing some data, but actively stripping everything account-level instead of tweet-level; if metadata like posting time was enough to tie tweets together, that could have interesting consequences for identifying sockpuppets or even deanonymizing the human users. But as far as I can tell, including account-level features makes this a non-story.