Hacker News new | ask | show | jobs
by InGodsName 2741 days ago
Bigquery takes minimum 2-3 seconds for every query.

Google Analytics is much faster, responds in a few hundreads milliseconds.

What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?

2 comments

> Google Analytics is much faster, responds in a few hundreads milliseconds.

Are you referring to their reporting API, or their collection endpoint? The collection endpoint is certainly fast to respond, but the actual reporting API can be quite slow depending on what you're trying to get from it.

> What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?

I'm not the parent, but I've created setups like what was mentioned. It sounds like they hosted the collection endpoint on AppEngine, then used DataFlow for streaming the data into BigQuery. Potentially using a Pub/Sub topic to queue up for DataFlow, since that has native integrations with DataFlow and even has a template available to support it[1].

[1] https://cloud.google.com/dataflow/docs/guides/templates/prov...

> Google Analytics is much faster, responds in a few hundreads milliseconds.

GA stores summary tables for each day for the basic values. If you have a large site and request segments or anything that's not in the summary tables, it can be quite slow.

Also, BigQuery is multi-tenant. GA would have dedicated instances.

> What did you use dataflow for? How did you get data from end points and insert them into bigquery? Using streaming inserts?

cosmie pretty much got it. AppEgnine collected. DataFlow sessionized and some other processing (geoip lookup, filtering, &c). BigQuery stored.

I actually had AppEngine dumping into Cloud Datastore, but I also experimented with PubSub and also using Cloud Storage access logs.