Hacker News new | ask | show | jobs
by taurine 2900 days ago
Glad I am not the only one. This seems like a task of memory, not author identification. How could this be used at test time?

10.000 users is meaningless on a social network with millions of accounts.

What about the static features (like account creation dates)? Aren't those overfitting with cross-validation? Would not learning curves be required when classifying on unseen future data (the reason d'etre of ML)?

1 comments

> What about the static features (like account creation dates)? Aren't those overfitting with cross-validation?

Yes, absolutely. The paper admits that using only account creation time with KNN was enough to get 98% accuracy all on its own. The authors then broke creation time into hour and minute to increase difficulty (slightly), and introduced heavy data fuzzing to see how far they could go and still get results.

But in practice, that's not an ML problem. It's just asking "how little data is needed to perform recall on this dataset?", and finding that it's relatively easy to associate Twitter accounts with, um, themselves.

I think this could have been made interesting by not just fuzzing some data, but actively stripping everything account-level instead of tweet-level; if metadata like posting time was enough to tie tweets together, that could have interesting consequences for identifying sockpuppets or even deanonymizing the human users. But as far as I can tell, including account-level features makes this a non-story.