Hacker News new | ask | show | jobs
by soared 2741 days ago
I mean as a POC its not bad, but google analytics is not the same as analyzing server logs (contrary to what most people would suggest). Most of the value of ga comes from session and user level metrics, which are 1000x more difficult to implement than showing pageviews. Unless you are planning on building a device graph that rivals google, you can't clone ga.
1 comments

> google analytics is not the same as analyzing server logs

This is what most people don’t get with ga.

Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.

Let’s say I use Puppeteer i can scrap this page a million time with completely wrong headers like « Netscape 8.1 ». GA purify this type of malicious attempts , it will probably look my IP Adress and figure out that it’s actually coming from only one IP and « Netscape » is too rare to be considered as an actual browser so it would probably ignore it.

All others « free google analytics alternatives » that exists today don’t have this type of mechanism to prevent from data corruption.

In general they just get an Http Request and acknowledge it as a legitimate visit.

Logging an Http request from a browser is not even a tenth of the work GA does under the hood.

> Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.

Unless it's referrer spam...that somehow still sticks around (at least last time I checked, which was several months ago).

If you hire an expert to set up your ga referrer spam doesn't get through, its super easy to filter before hand.
I have to disagree here. GA is very advanced but still rather dumb with data collection, and can be gamed in many ways, and I'm saying this as a user of GA for 10+ years along with their enterprise/premium suite.
I'm not sure how well that filtering works in practice. I think most of it is just that it only tracks clients that load javascript.
> Google Analytics does the heavy lift by removing incoherent , corrupted or malicious data insertion.

> Let’s say I use Puppeteer i can scrap this page a million time with completely wrong headers like « Netscape 8.1 ». GA purify this type of malicious attempts , it will probably look my IP Adress and figure out that it’s actually coming from only one IP and « Netscape » is too rare to be considered as an actual browser so it would probably ignore it.

I do a lot of work with GA, and have seen this misperception brought up a few times. When it comes to data processing, GA is not intelligent. If you haven't told it to do something explicitly, it isn't doing it. And if you tell it to do something, it'll only do that for all new data and will make no attempts to do it to historical data.

- GA is relatively robust against web scraping due to the fact that most scrapers don't render the page. So the GA-related code on the page is never executed and a hit is never made to Google's servers. If the scraper is using a headless browser, such as Puppeteer, and renders the page, then it will in fact send that hit to GA.

- If you've checked the "Exclude bots" view setting[1], it will apply the IAB Spiders and Bots list to traffic[2]. This is a deterministic list of user-agent based filters to apply[3], and anyone is capable of paying for it. Google just gives it to GA users for free via the Exclude bots filter.

- The Exclude Bots setting does nothing else than that. Scrapers like Puppeteer by default report their user agent as the version of Chromium they're using. These will show up just like any legitimate user to your site that also is browsing with that specific version of Chromium.

- GA has pretty robust filtering options[4]. But you have to manually create them. And they don't apply retroactively. You can filter IPs here, and only here. While you can apply reporting filters after the fact on a lot of fields, IP addresses aren't available as one of those fields. This makes it really frustrating to retroactively get rid of junk traffic, whether internal or automated/scrapping. You can approximate it by getting creative with fields that make a good proxy. The only exception to this would be GA360/Google Marketing Cloud customers, since they can access their clickstream data via BigQuery as part of their subscription.

- GA's interface will give you really smart looking notifications now like "Filter internal traffic. Hits from your corporate network are showing up in property example.com". It's not doing anything super neat like dynamically cross-referencing your IP address as you're in the admin area against the collected data in your GA property. It's literally just triggering that warning based off the fact that you haven't applied any IP-based filters applied yet.

There are quite a few other completely unintuitive aspects of GA that are rooted in the fact that their data processing model is incredibly straightforward, and there are very few exceptions to it and virtually no edge cases taken into account. Which leads to a lot of instances where people's expectations on behavior decouple from actual behavior. But a good rule of thumb is that, if a particular functionality or metric seems even remotely like it'd require extra computation or complexity to implement in a way to make it match what you're thinking. Then it's highly likely it doesn't work the way you think.

[1] https://support.google.com/analytics/answer/1010249?hl=en

[2] https://www.iab.com/guidelines/iab-abc-international-spiders...

[3] https://www.iab.com/wp-content/uploads/2015/11/IAB_SpidersBo...

[4] https://support.google.com/analytics/topic/1032939?hl=en&ref...

Great comment. I didn't know the exclude bots toggled used an iab list.. thats excellent information for me to know. Thanks!