|
|
|
Ask HN: Any pragmatic guides to building large data pipelines?
|
|
6 points
by elt
1696 days ago
|
|
I am rebuilding a data pipeline that processes billions of records. An overview of what I built is as follows: collect data from (n*k) sources-> derive new data -> generate a unified/merged collection of data (n) data. The current solution is all hand crafted code. I know this is a 10,000 foot view of the problem, but are there any guides or books on how to better design and implement this type of solution? |
|
You can get pretty far with R or Pandas + Scipy on a fast machine, after that then you start taking on more hassle of Spark or whatever fits your situation.
Oh, and 0) pain that's motivating the rebuild. Feel free to e-mail me even just to rubber duck your thinking.