Hacker News new | ask | show | jobs
by gandalfu 4110 days ago
Why will the US government send all their traffic stats to Google? Why not use piwik? http://piwik.org/

Im I being too paranoid? This is after all third party software served by the government.

2 comments

I think it's a reasonable question and I'm a big fan of piwik, but running it at this kind of scale is very hard. We get about 10k hits/day on the piwik instance we run and it's consistently taking more resources than the application it's tracking.
Are there any other viable open source traffic analysis tools than Piwik? I'd hate to have to roll my own.
Snowplow (I'm a co-founder) can happily scale to billions of events per day: https://github.com/snowplow/snowplow
Judging from the repo: "Collectors receive Snowplow events from trackers. Currently we have three different event collectors, sinking events either to Amazon S3 or Amazon Kinesis" (etc) -- it's still not viable to self-host snowplow on own hardware/internal cloud etc? Or is it possible, but you need to run a full cloud? (I understand why one would want a setup that runs on Amazon, if one uses amazon, but when you host your own infrastructure, a self-host option would be nice ... if viable).

Without an option to self-host, snowplow isn't really an alternative to pwiki.

Hey e12e! It's a great question. You are right - at the moment Snowplow is still tied to the AWS cloud; we use a variety of AWS services which support massively horizontal processing, including Elastic MapReduce, Kinesis and Redshift. We are working on a Kafka+Samza version of Snowplow which we will release later this year, most likely running on a Mesos cluster that you can deploy where you want.
We have to move away from US hosted services, so we have to wait for the Kafka+Samza version if we go that route. Thanks!
https://github.com/divolte/divolte-collector is quite nice and can handle extreme loads
That is interesting too for us, as Kafka is possibly in our future too. Thanks!
I building something based on Splunk. Open source + free Splunk license.

http://www.mensk.com/traffic-ray-new-splunk-app-to-visualize...

Especially interesting would be a Go or Node.js project that use a caching layer (like Memcached) to scale better than writing directly to a SQL/NoSQL database.
Having worked with the US government, I'm guessing that they don't really know what 18F is doing, or the implications.
The implications are really scary. Im the first one to applaud the new digital office initiative and the talent behind, but when it comes to third party software the government should trust no one, not matter their competence.

Scary scenario #1: All of my interactions with the government are known to google.

Scary scenario #2: Google CDN is compromised and malware is served to everyone!