Hacker News new | ask | show | jobs
by avs733 1500 days ago
their TL;DAbstract refers to this as a 'conservative' methodology, that is 'rigorous', and 'likely undercounts.

Their definition:

> “Spam or Fake Twitter accounts are those that do not regularly have a human being personally composing the content of their tweets, consuming the activity on their timeline, or engaging in the Twitter ecosystem.”

They note the following to differentiate fake and spam: > Many “fake” accounts under this definition are neither nefarious nor problematic. ... By contrast, most “spam” accounts are an unwanted nuisance.

Some general data analytics notes from their post:

* Then lump together fake and spam in their analysis - and this really matters! somewhere like NYT is both 'fake' meaning it isn't a real person and A HIGHLY VALUABLE ACCOUNT for twitter to have.

* They use a sample of 44,058 accounts (of ~1.047B)

* They look at a number of classifying variables (17), spam accounts met 10+ of those 17 criteria. They don't list all 17.

* The criteria were developed from a "machine learning process" that is undescribed, and was developed from a sample of 35,000 'known' fake twitter followers bought from 3 vendors and 50,000 claimed non-spam accounts. They appear (imply?) to have used 50% training 50% real data but dont't specify explicitly.

* They say their model is about 65% accurate, and unlikely to produce false positives ("almost never includes false positives") - however they don't list any specificity, sensitivity, etc. that would be useful to evaluating that claim.

* The analysis does no statistical tests, no confidence intervals, minimal information about how the model was tested or validated.

* Critically: they note, but do not describe or quantify, that a lot of the criteria are highly correlated

* then later in the article they suddenly seem to switch to a 10 point scale for quality away from their 17 point scale? with a threshold of 3 or below as low quality?

* My personal twitter account meets most of the metrics where they have listed a quantifiable threshold. And their fake followers tool lists it as pretty f'ing suspicious - i.e., low quality.

I'm not saying there wrong but I am saying good luck getting this from a blog post to any sort of respectable science publication. As they note at the end, they aren't even calculating the same metric - twitter uses monetizable daily active users - remember NYtimes? Absolutely a monetizable account - even if it isn't a real person.

anyone who thinks this is proof of Elon's 4D chess based on this article is, to me, frankly delusional.