| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by squarecog 3159 days ago

I helped build Twitter's data platform, 2010-2016.

There isn't an "analysis server" and analyzing user activity is not done on a "user database backup" at Twitter's scale, though indeed that's a common way that would be done for smaller businesses.

By the way, if by user db you literally mean the db with user accounts, that's not the right data source -- you want the user _activity_ db to count active users, and for high-scale applications, those are different things. Presumably user activity updates are orders of magnitude more frequent than user object updates. You don't want to thrash your user db by constantly updating some "last seen at" field. Put that stuff somewhere else.

That said, it's true that counting is simple, it's just a Hadoop / Spark / distributed computing platform of choice job. Filter, distinct, count. It's not even hard in real-time if you have enough ram or are ok with approximate counts with bounded error, thanks to Storm, Heron, Flink, etc.

Defining what exactly constitutes an active user and catching edge cases such as this Digits thing is where things get tricky; the number of weird scenarios that cause under/overcount for what seem like reasonable and straightforward definitions would surprise you.

@baddox nailed it.

1 comments

Dylan16807 3159 days ago

Thanks. Note that I wasn't trying to guess at what twitter does, just to provide a workflow that should be viable almost anywhere, in the absence of easier options. It's good to hear that the underlying idea of "calculating the metric isn't the hard part" is true.

link