| their TL;DAbstract refers to this as a 'conservative' methodology, that is 'rigorous', and 'likely undercounts. Their definition: > “Spam or Fake Twitter accounts are those that do not regularly have a human being personally composing the content of their tweets, consuming the activity on their timeline, or engaging in the Twitter ecosystem.” They note the following to differentiate fake and spam:
> Many “fake” accounts under this definition are neither nefarious nor problematic. ... By contrast, most “spam” accounts are an unwanted nuisance. Some general data analytics notes from their post: * Then lump together fake and spam in their analysis - and this really matters! somewhere like NYT is both 'fake' meaning it isn't a real person and A HIGHLY VALUABLE ACCOUNT for twitter to have. * They use a sample of 44,058 accounts (of ~1.047B) * They look at a number of classifying variables (17), spam accounts met 10+ of those 17 criteria. They don't list all 17. * The criteria were developed from a "machine learning process" that is undescribed, and was developed from a sample of 35,000 'known' fake twitter followers bought from 3 vendors and 50,000 claimed non-spam accounts. They appear (imply?) to have used 50% training 50% real data but dont't specify explicitly. * They say their model is about 65% accurate, and unlikely to produce false positives ("almost never includes false positives") - however they don't list any specificity, sensitivity, etc. that would be useful to evaluating that claim. * The analysis does no statistical tests, no confidence intervals, minimal information about how the model was tested or validated. * Critically: they note, but do not describe or quantify, that a lot of the criteria are highly correlated * then later in the article they suddenly seem to switch to a 10 point scale for quality away from their 17 point scale? with a threshold of 3 or below as low quality? * My personal twitter account meets most of the metrics where they have listed a quantifiable threshold. And their fake followers tool lists it as pretty f'ing suspicious - i.e., low quality. I'm not saying there wrong but I am saying good luck getting this from a blog post to any sort of respectable science publication. As they note at the end, they aren't even calculating the same metric - twitter uses monetizable daily active users - remember NYtimes? Absolutely a monetizable account - even if it isn't a real person. anyone who thinks this is proof of Elon's 4D chess based on this article is, to me, frankly delusional. |