Hacker News new | ask | show | jobs
by vgt 3284 days ago
yea I think within the context of BigQuery the most sensible thing would be to do an aggregate per the column that would be considered a primary key. For example [0]. That said, Streaming API de-dupe window is very nice in practice.

I mentioned elsewhere on Google Cloud the most elegant way of doing this is with Google Cloud Dataflow [1]

(work at G)

[0]https://stackoverflow.com/questions/38446499/bigquery-dedupl...

[1]https://cloud.google.com/blog/big-data/2017/06/how-qubit-ded...