Hacker News new | ask | show | jobs
by yellowbeard 3154 days ago
Monthly actives is generally the primary metric and it's not so easy to calculate.
3 comments

Quote from http://money.cnn.com/2017/10/26/technology/business/twitter-...

"These third-party applications used Digits, a software development kit of our now-divested Fabric platform, that allowed third-party applications to send authentication messages via SMS through our systems, which did not relate to activity on the Twitter platform," the company explained in its earnings report.

Really seems like something they should have caught earlier.

Do you mean it’s not easy to define? It shouldn’t be difficult to calculate any particular metric going forward, but it’s inportant to define what it means to be “active.”
Calculating these metrics at scale is not trivial.
In real time, yes.

But the user database should already have backups, importing those backups into an analysis server should be easy, and running queries like that on an analysis server should be easy.

Counting messages, or users with X messages, etc. is also largely a function of whether your backup/restore system works. But this time you do it in chunks.

I helped build Twitter's data platform, 2010-2016.

There isn't an "analysis server" and analyzing user activity is not done on a "user database backup" at Twitter's scale, though indeed that's a common way that would be done for smaller businesses.

By the way, if by user db you literally mean the db with user accounts, that's not the right data source -- you want the user _activity_ db to count active users, and for high-scale applications, those are different things. Presumably user activity updates are orders of magnitude more frequent than user object updates. You don't want to thrash your user db by constantly updating some "last seen at" field. Put that stuff somewhere else.

That said, it's true that counting is simple, it's just a Hadoop / Spark / distributed computing platform of choice job. Filter, distinct, count. It's not even hard in real-time if you have enough ram or are ok with approximate counts with bounded error, thanks to Storm, Heron, Flink, etc.

Defining what exactly constitutes an active user and catching edge cases such as this Digits thing is where things get tricky; the number of weird scenarios that cause under/overcount for what seem like reasonable and straightforward definitions would surprise you.

@baddox nailed it.

Thanks. Note that I wasn't trying to guess at what twitter does, just to provide a workflow that should be viable almost anywhere, in the absence of easier options. It's good to hear that the underlying idea of "calculating the metric isn't the hard part" is true.
Oh, fair that that would be a more important metric, but when they said "user base" I incorrectly assumed they meant "all registered users."