| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pid-1 1178 days ago

> Okay, so, where is your training data? Is your training data in the layout which your training code can just linearly scan on S3? Or you have to transform them first? Or provision a dataset cache on-demamd? Is this data engineering or training orchestration?

Not claiming this is the only or best solution, but the way my team solved that was by creating an internal Python lib with common happy paths to access our infrastructure and processes. We deploy our data pipelines as FastAPI services and call them using Airflow. This architecture has scaled really well: we have 300+ data pipelines, even more schedules and 3 engineers. We use Knative so our AWS bill is quite cheap for the number of services we are running.

It all boiled down to treating ml / data engineering problems as common software problems.

1 comments

rfoo 1178 days ago

Thanks for sharing your experience.

Yeah, that's my point, it's hardly a solved problem and you have to write software for this!

link