Hacker News new | ask | show | jobs
by mlissner 4111 days ago
I think it's a reasonable question and I'm a big fan of piwik, but running it at this kind of scale is very hard. We get about 10k hits/day on the piwik instance we run and it's consistently taking more resources than the application it's tracking.
1 comments

Are there any other viable open source traffic analysis tools than Piwik? I'd hate to have to roll my own.
Snowplow (I'm a co-founder) can happily scale to billions of events per day: https://github.com/snowplow/snowplow
Judging from the repo: "Collectors receive Snowplow events from trackers. Currently we have three different event collectors, sinking events either to Amazon S3 or Amazon Kinesis" (etc) -- it's still not viable to self-host snowplow on own hardware/internal cloud etc? Or is it possible, but you need to run a full cloud? (I understand why one would want a setup that runs on Amazon, if one uses amazon, but when you host your own infrastructure, a self-host option would be nice ... if viable).

Without an option to self-host, snowplow isn't really an alternative to pwiki.

Hey e12e! It's a great question. You are right - at the moment Snowplow is still tied to the AWS cloud; we use a variety of AWS services which support massively horizontal processing, including Elastic MapReduce, Kinesis and Redshift. We are working on a Kafka+Samza version of Snowplow which we will release later this year, most likely running on a Mesos cluster that you can deploy where you want.
We have to move away from US hosted services, so we have to wait for the Kafka+Samza version if we go that route. Thanks!
https://github.com/divolte/divolte-collector is quite nice and can handle extreme loads
That is interesting too for us, as Kafka is possibly in our future too. Thanks!
I building something based on Splunk. Open source + free Splunk license.

http://www.mensk.com/traffic-ray-new-splunk-app-to-visualize...

Especially interesting would be a Go or Node.js project that use a caching layer (like Memcached) to scale better than writing directly to a SQL/NoSQL database.