| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by renonce 1181 days ago
	Keeping a complete log of all GET requests to random files in a public repository in a reliable way would be insane.

2 comments

otoolep 1181 days ago

No, it wouldn’t be - assuming by “insane” you mean “silly to do”. I build systems at Google that do exactly that.

Whether it’s worth the cost is a decision each company makes. Also, you don’t need to keep the log forever. Max of a few weeks retention would be common.

link

tgsovlerkhgsel 1179 days ago

What guarantees do these systems provide? Are 100% of requests where data was served guaranteed to either end up in the log or at least create a detectable "logs may have been lost here" event?

Or does it log all the requests all the time as long as everything goes well, but if the wrong server crashes at the wrong time, a couple hundred requests may get lost because in the end who cares?

link

pbhjpbhj 1181 days ago

Presumably, keeping 'last remotely accessed' and 'last remotely modified' for every file (or other stats that are a digest of the logs) is sane for pretty much any system too. Having a handle on how much space one is dedicating to files that are never viewed and or never updated seems like something web companies that have public file access would all want?

link

est31 1181 days ago

It's not just GET requests. Someone could have cloned/refreshed the repo using ssh. The repo might have been indexed by github's internal search daemon which might not use the public HTTP API but uses internal access ways however those might look like. You might have purged the database of that daemon but what about backups of it? What about people who have subscribed to public events happening in the github.com/github org via the API?

You'd have to have logging set up for all of these services and it would have to work over your entire CDN... and what if a CDN node crashed before it was able to submit the log entry to the log collector? You'll never know.

link