| > but every time I try to use it, I get stuck on "how do I get data into it reliably" That's the same stage I get stuck every time. I have data emitters (in this example let's say my household IoT devices, feeding a MQTT broker then HomeAssistant). I have where I want the data to end up (Clickhouse, Database, S3, whatever). How do I get the data from A to B, so there are no duplicate rows (if the ACK for an upload isn't received when the upload succeeded), no missing rows (the data is retried if an upload fails), and some protection if the local system goes down (data isn't ephemeral)? The easiest I've found is writing data locally to files (JSON, parquet, whatever), new file every 5 minutes and sync the older files to S3. But then I'm stuck again. How do I continually load new files from S3 without any repetition or edge cases? And did I really need the intermediate files? |
Duplicates get merged out, and errors can be handles at the http level. (Admittedly, one bad row in a big batch post is a pain, but I don’t see that much)