Hacker News new | ask | show | jobs
by memetherapy 1992 days ago
Genuine question from someone from an entirely different world - why on earth do you have 10 billion log entries? What is in them and do you ever do anything with them that requires you to store so much data rather than just a representative subset?
3 comments

Author here! These 10B log lines are from the last 60 days of activity from https://gocardless.com/ systems.

It includes:

- System logs, such as our Kubernetes VM host logs, or our Chef Postgres machines

- Application logs from Kubernetes pods

- HTTP and RPC logs

- Audit logs from Stackdriver (we use GCP for all our infrastructure)

> do you ever do anything with them that requires you to store so much data rather than just a representative subset?

Some of the logs are already sampled, such as VPC flow logs, but the majority aim for 100% capture.

Especially for application logs, which are used for audit and many other purposes, developers expect all of their logs to stick around for 60d.

Why we do this is quite simple: for the amount of value we get from storing this data, in terms of introspection, observability and in some cases differentiated product capabilities like fraud detection, the cost of running this cluster is quite a bargain.

I suspect we'll soon cross a threshold where keeping everything will cost us more than it's worth, but I'm confident we can significantly reduce our costs with a simple tagging system, where developers mark logs as requiring shorter retention windows.

Hopefully that gives you a good answer! In case you're interested, my previous post mentioned how keeping our HTTP logs around in a queryable form was really useful for helping make a product decision:

https://blog.lawrencejones.dev/connected-data/

Thanks for the response, really interesting to see how this stuff is used.
Are you also using Google Tracer? I haven't been to get any traces to work for ages with Node.
>>> why on earth do you have 10 billion log entries?

It's pretty low volume actually. A small company with < 100 developers and servers can generate a billion logs over a few weeks.

Normal logs from the system, syslog, applications, databases, web servers... nothing fancy really. It's common practice to centralize all these into ElasticSearch or Splunk.

Their scale of 10 billion logs 60 TB means they're a regular small to medium company.

You've nailed this!

This logging system was for all https://gocardless.com/ systems. We're a B2C company which means we have different economies of scale than many scale-ups of our size, but you were close with your guess:

Currently 450 people worldwide, ~150 in product development, of which ~100 fulltime developers.

This seems suspect, that works out to approximately 25 log messages per developer per second assuming a 10 hour work day.

I work in a tightly regulated industry (finance), and even my company doesn't have a need to log 25 messages per second per person.

Is anyone else able to validate this claim that regular small companies log this much data?

Anytime its something ridiculous like this, I assume its for compliance. A few industries require all info to be retained for 7 years.