|
|
|
|
|
by lysecret
906 days ago
|
|
I am currently working with about 100TB data on GCP with BigQuery as a query engine and simple hive partitioning like /key3=000/key2=002/. We are happy because we can run all the queries you want and it is insanely cheap. But latency is reaching quite high levels (it doesn't matter so much for us) but I was wondering, if implementing Iceberg would improve this? Has anyone experience with this? Overall this kind of architecture is just awesome. |
|
You need to define what "latency" means in your case and what is "quite high levels". We are talking about analytical data storage, it is designed for efficient batch processing. To find a single record is not a primary goal of the architecture - you will need some kind of caching/indexing for fast search. Sometimes adding "limit 1" for your single record search may solve the problem.
Be sure you are using efficent data storage format as parquet, check size of the files to be sure you don't have the ["small file problem"](https://www.royalcyber.com/blog/data-services/managing-small...), then check if you are using relevant BigQuery features. And before and after those checks run "explain" on your query, if you don't use partition keys or indexed columns your search results won't be instant in any big data system.