Hacker News new | ask | show | jobs
by renonce 1181 days ago
Keeping a complete log of all GET requests to random files in a public repository in a reliable way would be insane.
2 comments

No, it wouldn’t be - assuming by “insane” you mean “silly to do”. I build systems at Google that do exactly that.

Whether it’s worth the cost is a decision each company makes. Also, you don’t need to keep the log forever. Max of a few weeks retention would be common.

What guarantees do these systems provide? Are 100% of requests where data was served guaranteed to either end up in the log or at least create a detectable "logs may have been lost here" event?

Or does it log all the requests all the time as long as everything goes well, but if the wrong server crashes at the wrong time, a couple hundred requests may get lost because in the end who cares?

Presumably, keeping 'last remotely accessed' and 'last remotely modified' for every file (or other stats that are a digest of the logs) is sane for pretty much any system too. Having a handle on how much space one is dedicating to files that are never viewed and or never updated seems like something web companies that have public file access would all want?
It's not just GET requests. Someone could have cloned/refreshed the repo using ssh. The repo might have been indexed by github's internal search daemon which might not use the public HTTP API but uses internal access ways however those might look like. You might have purged the database of that daemon but what about backups of it? What about people who have subscribed to public events happening in the github.com/github org via the API?

You'd have to have logging set up for all of these services and it would have to work over your entire CDN... and what if a CDN node crashed before it was able to submit the log entry to the log collector? You'll never know.