Hacker News new | ask | show | jobs
by pentium10 3095 days ago
We are using BQ for 4 years now. Since the beginning with dual-writing to diferent engines. We did an architecture that routes data to Redis, Riak, Elasticsearch, Mysql, BigQuery. We often have 500+ lines of queries in BigQuery writen in standard SQL that runs our reporting system. The average query time we have is 10 seconds usually scanning at least a terabyte. We truly love the ability to write in Javascript the UDFs, and our marketing team loves to combine their Google Sheets (mapping data) with a real BQ query. We heavily use the 'dryrun" feature of BQ to see without running the query how much resources and costs will generate. We even implemented a quota for users, based on this feature, wich works great, and only when they do a mistake in a query targetting alerts them at a certain treshold. Afer a couple of years we naturally started to consolidate all our data into BigQuery. Since it's Standard SQL all our team members can start using and reporting + tooling works quite well. DataStudio is another service which we started to use more and more as well.
1 comments

How do you handle updates in BigQuery in case your queries require latest version of records? Do you use views? If yes does that affect the query latency too much?
During the 4 years we at REEA.net tried a lot of approaches, in different projects. 1. We stream every upate as a new row directly to the table. We have for every table a view that has a naming convention "table_latest" and it targets latest version of the record. We didn't see a big latency increase. Deleted rows have a 'deleted' flag, we use EXCEPT in the query. 2. Since DML is available we started on newer projects dual aproach, having a dedicated streaming table, and rematerializing to a flat table. It adds developer complexity but it's a win, as the materialized table is easier to be used by our marketing team. We added to the reports "Based on data 10 minutes ago" label, and a CTA "re-run using live data", which in background calls the script to rematerialize the table and rerun the report. Our markting team loves this, and they don't abuse this functionality. On some projcts we have remateralization every 1h, on others daily. 3. regarding partitioned tables where you can stream only 7 days of window into their respective partitions we use also a complementary load job. We stream in the 7 day window. And we have a heavy "future date" use case, where we use "load jobs" to a temp table, then query and write to their respective partitions.