|
|
|
|
|
by peterwaller
4348 days ago
|
|
This just made me discover the github archive. $ wget http://data.githubarchive.org/2014-07-21-{0..23}.json.gz
...
Downloaded: 24 files, 129M in 30s (4.25 MB/s)
Cool. A day's worth of public events is 129MB compressed. That's surprisingly small! Let's play for a second. $ ls *.gz | xargs -P4 -n1 gunzip
$ du -sch *.json
...
807M total
Time to break out JQ: https://stedolan.github.io/jq/manual/ $ time jq .type *.json | wc -l
408218
real 0m16.788s
user 0m16.366s
sys 0m0.325s
That's an easy amount of data to mess with. If a day is 16 seconds to process, I can do 14 years on my measly desktop in one day! 408k public records - around 5 a second. I somehow imagined events would flood into github even faster than that. I wonder what their public/private activity ratio is.Let's explore the event types: $ time jq .type *.json | sort | uniq -c | sort -n
405 "PublicEvent"
697 "TeamAddEvent"
1018 "ReleaseEvent"
1636 "MemberEvent"
3166 "CommitCommentEvent"
3892 "GollumEvent"
6925 "DeleteEvent"
7051 "PullRequestReviewCommentEvent"
14807 "ForkEvent"
18579 "PullRequestEvent"
19919 "IssuesEvent"
37942 "WatchEvent"
38402 "IssueCommentEvent"
46033 "CreateEvent"
207746 "PushEvent"
Pushes dominate - 10 pushes for every issue created.This is probably more than enough for an HN comment. It'll be fun to see what people do with this stuff this year. :) |
|
I wasn't aware until today that you could use BigQuery on a recently-updated data set, though.