| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nbm 4286 days ago

A lot of that data is duplicated to allow for efficient querying or transformation. It often is too slow to process the data as it comes in, so an initial process will write the data in a raw form, and some other process might select a subset of the data to process, and then submit it in an "annotated" form (filling in, say, the AS number of the client IP). Another process will run later in a batched fashion and perhaps annotate the full set of information and summarize it into a bunch of easily-queried tables.

A lot of that data is also not tied to individuals either - for example the access logs for the CDN (which, being on a different domain by design, does not share cookies so is not attached to an account) even reasonably heavily sampled is probably tens of gigabytes a day, and is rolled up into efficient forms for queries in various ways. A lot of it isn't even about requests coming through the web site/API - it may just be internal inter-service request information, or inter-datacenter flow analysis, or per-machine service metrics ("Oh, look, process A on machines B through E went from 2GB resident to 24GB in 30 seconds a few seconds before the problem manifested").

(Not that it makes too much of a difference at this scale, but it is closer to 860M daily actives.)