Hacker News new | ask | show | jobs
by bradhe 2442 days ago
So, if I'm reading this correctly, 2.5GB/s of log data being generated? If we assume (aggressively) that they have 5mil machines in their infrastructure, doesn't that mean that each machine would have to be generating 500kB/s of log data?

Despite that, I find the claims to be underwhelming. So your system can process massive amounts of data by scaling massively horizontally...neat.

1 comments

The number in the article is 2.5 TB/s, not GB/s :)

(disclaimer: I work in Scribe)

Right—sorry. But point still stands. Under what circumstances was that much data being generated from (what I’m assuming is) normal logging?
I'm not following. I understood from your first comment that you think the amount of data is low ("underwhelming") and from your last comment that it's a lot ("that much data").

In any case, the data is "whatever needs to be logged".

And it's not "server logs", which is what I'm interpreting from your comment. Scribe transports most data at Facebook to be processed by real-time systems (e.g. Puma, Scuba) and also "batch systems" (data warehouse). So, it's quite a lot, being "the ingestion pipe" for Facebook.

Does this answer your question? :-?

Puma: https://research.fb.com/publications/realtime-data-processin...

Scuba: https://research.fb.com/publications/scuba-diving-into-data-...

> So, it's quite a lot, being "the ingestion pipe" for Facebook.

I see. I walked away from the article with the impression that it was meant to be a log aggregation service a la flume, splunk, or logstash.

> the amount of data is low ("underwhelming") and from your last comment that it's a lot ("that much data").

I was remarking on the numbers in regard to generation, not consumption. Based on the article, my estimate is pointing out that generating 2.5TB/s of transactional logs and telemetry data using "millions" of machines would be technically possible but not reasonably practical...and thus likely not real ;). But, you corrected my understanding: That number isn't based on a different use case.