Hacker News new | ask | show | jobs
by jimktrains2 2741 days ago
Interesting. I once built a ga clone using appengine, cloud dataflow, and big query. I guess that would count as serverless? Benchmarked it against the official dumps to big quey too and it was pretty spot on for every metric we could lookup!
2 comments

Yes. I guess your setup is serverless as well. Big Query is one of Serverless MPP databases that shares similar concepts with AWS Athena.
Bigquery takes minimum 2-3 seconds for every query.

Google Analytics is much faster, responds in a few hundreads milliseconds.

What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?

> Google Analytics is much faster, responds in a few hundreads milliseconds.

Are you referring to their reporting API, or their collection endpoint? The collection endpoint is certainly fast to respond, but the actual reporting API can be quite slow depending on what you're trying to get from it.

> What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?

I'm not the parent, but I've created setups like what was mentioned. It sounds like they hosted the collection endpoint on AppEngine, then used DataFlow for streaming the data into BigQuery. Potentially using a Pub/Sub topic to queue up for DataFlow, since that has native integrations with DataFlow and even has a template available to support it[1].

[1] https://cloud.google.com/dataflow/docs/guides/templates/prov...

> Google Analytics is much faster, responds in a few hundreads milliseconds.

GA stores summary tables for each day for the basic values. If you have a large site and request segments or anything that's not in the summary tables, it can be quite slow.

Also, BigQuery is multi-tenant. GA would have dedicated instances.

> What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?

cosmie pretty much got it. AppEgnine collected. DataFlow sessionized and some other processing (geoip lookup, filtering, &c). BigQuery stored.

I actually had AppEngine dumping into Cloud Datastore, but I also experimented with PubSub and also using Cloud Storage access logs.