|
|
|
|
|
by dikei
1498 days ago
|
|
It's remarkable how the data pipeline in almost all companies converge to the same architecture: * You have services emit data into streams. * You dump the streams into your storage with high frequency so you can have near real-time result, this process will create many small files. * Because small files are inefficient, you have compactors that run over the small files and merge them into bigger files, and/or delete records that's obsolete. * You run a query engine that read over the small files and large files to get the final result. * To speed up step 2,3,4 you store the metadata of the files in-memory / in a database. |
|