| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by peterwaller 4348 days ago

This just made me discover the github archive.

  $ wget http://data.githubarchive.org/2014-07-21-{0..23}.json.gz
  ...
  Downloaded: 24 files, 129M in 30s (4.25 MB/s)

Cool. A day's worth of public events is 129MB compressed. That's surprisingly small! Let's play for a second.

  $ ls *.gz | xargs -P4 -n1 gunzip
  $ du -sch *.json
  ...
  807M	total

Time to break out JQ: https://stedolan.github.io/jq/manual/

  $ time jq .type *.json | wc -l
  408218

  real	0m16.788s
  user	0m16.366s
  sys	0m0.325s

That's an easy amount of data to mess with. If a day is 16 seconds to process, I can do 14 years on my measly desktop in one day! 408k public records - around 5 a second. I somehow imagined events would flood into github even faster than that. I wonder what their public/private activity ratio is.

Let's explore the event types:

  $ time jq .type *.json | sort | uniq -c | sort -n
      405 "PublicEvent"
      697 "TeamAddEvent"
     1018 "ReleaseEvent"
     1636 "MemberEvent"
     3166 "CommitCommentEvent"
     3892 "GollumEvent"
     6925 "DeleteEvent"
     7051 "PullRequestReviewCommentEvent"
    14807 "ForkEvent"
    18579 "PullRequestEvent"
    19919 "IssuesEvent"
    37942 "WatchEvent"
    38402 "IssueCommentEvent"
    46033 "CreateEvent"
   207746 "PushEvent"

Pushes dominate - 10 pushes for every issue created.

This is probably more than enough for an HN comment. It'll be fun to see what people do with this stuff this year. :)

1 comments

minimaxir 4348 days ago

The Google BigQuery implementation of the archive can do such a query across all the data in seconds.

I wasn't aware until today that you could use BigQuery on a recently-updated data set, though.

link

rsivapr 4348 days ago

I can confirm. That query took about 2 seconds. More discussion here: http://www.datatau.com/item?id=3608

link