|
|
|
|
|
by dm3
334 days ago
|
|
Looks like we're in a similar situation. What is your current go-to for setting up lean incremental data pipelines? For me the core of the solution - parquet in object store at rest and arrow for IPC - haven't changed in years, but I'm tired of re-building the whole metadata layer and job dependency graphs at every new place. Of course the building blocks get smarter with time (SlateDB, DuckDB, etc.) but it's all so tiresome. |
|
On the front end I've always had reasonable outcomes with `wandb` for tracking runs once you kind get it all set up nicely, but it's a long tail of configuration and writing a bunch of glue code.
In this situation I'm dealing with a pretty medium amount of data and very modest model training needs (closer to `sklearn` than some mega-CUDA thing) and it feels like I should be able to give someone the company card and just get one of those things with 7 programming languages at the top of the monospace text box for "here's how to log a row", we do Smart Things and now you have this awesome web dashboard and you can give your quants this `curl foo | sh` snippet and their VSCode Jupyter will be awesome.