Hacker News new | ask | show | jobs
by lysecret 906 days ago
I am currently working with about 100TB data on GCP with BigQuery as a query engine and simple hive partitioning like /key3=000/key2=002/. We are happy because we can run all the queries you want and it is insanely cheap. But latency is reaching quite high levels (it doesn't matter so much for us) but I was wondering, if implementing Iceberg would improve this? Has anyone experience with this?

Overall this kind of architecture is just awesome.

4 comments

There is nice summary on the topic: https://aws.amazon.com/blogs/big-data/choosing-an-open-table... ("Optimizing read performance"). Those technologies primary "Data Management at Scale" but they also extend capabilities provided by raw storage formats such as parquet. So they may help you, but the question if you are really need it. I haven't worked with BigQuery, it may include [similar features](https://cloud.google.com/bigquery/docs/search-index).

You need to define what "latency" means in your case and what is "quite high levels". We are talking about analytical data storage, it is designed for efficient batch processing. To find a single record is not a primary goal of the architecture - you will need some kind of caching/indexing for fast search. Sometimes adding "limit 1" for your single record search may solve the problem.

Be sure you are using efficent data storage format as parquet, check size of the files to be sure you don't have the ["small file problem"](https://www.royalcyber.com/blog/data-services/managing-small...), then check if you are using relevant BigQuery features. And before and after those checks run "explain" on your query, if you don't use partition keys or indexed columns your search results won't be instant in any big data system.

We're at ~1/10th your scale but query speed is a major concern due to (rightfully) impatient clients when it comes to data viz. Unless you're using BQ tables as input for high throughput compute, I'd focus on optimizing your BI tool or creating analysis tables that would prevent end-user slowdown.

Ex: Recently created a big table (by materializing fact/dimension table joins and COALESCE operations) solely for analysis purposes. It sits "outside" our normal data warehouse setup mentally, but we can still maintain data quality/lineage as it exists inside dbt. Allowed us to do away with Tableau fixed calculations and cut load/group by times for end users ~95%.

Iceberg won’t speed up your queries if you are using BQ native storage. It might speed up federated queries to GCS/S3
This looks pretty neat if you're ok moving to AWS https://www.boilingdata.com/