| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dikei 1498 days ago

It's remarkable how the data pipeline in almost all companies converge to the same architecture:

* You have services emit data into streams.

* You dump the streams into your storage with high frequency so you can have near real-time result, this process will create many small files.

* Because small files are inefficient, you have compactors that run over the small files and merge them into bigger files, and/or delete records that's obsolete.

* You run a query engine that read over the small files and large files to get the final result.

* To speed up step 2,3,4 you store the metadata of the files in-memory / in a database.