| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pid-1 1179 days ago
	Why isn't that solved by k8s + a node autoscaler such as Karpenter?

2 comments

rfoo 1179 days ago

That's a viable solution, but since GPU instances are expensive, you really want to make most of it. Ideally the GPU should be busy within 30 seconds of instance launch.

Okay, so, where is your training data? Is your training data in the layout which your training code can just linearly scan on S3? Or you have to transform them first? Or provision a dataset cache on-demamd? Is this data engineering or training orchestration?

link

pid-1 1178 days ago

> Okay, so, where is your training data? Is your training data in the layout which your training code can just linearly scan on S3? Or you have to transform them first? Or provision a dataset cache on-demamd? Is this data engineering or training orchestration?

Not claiming this is the only or best solution, but the way my team solved that was by creating an internal Python lib with common happy paths to access our infrastructure and processes. We deploy our data pipelines as FastAPI services and call them using Airflow. This architecture has scaled really well: we have 300+ data pipelines, even more schedules and 3 engineers. We use Knative so our AWS bill is quite cheap for the number of services we are running.

It all boiled down to treating ml / data engineering problems as common software problems.

link

rfoo 1178 days ago

Thanks for sharing your experience.

Yeah, that's my point, it's hardly a solved problem and you have to write software for this!

link

tomrod 1178 days ago

Operationally very simple: ELT -> GUID-based naming convention on S3 or Lustre on FSx (name and keep if preserving data, not replication steps) -> Point GPU instance to data (e.g. Sagemaker can transfer data stored on S3 with different approaches and costs, YMMV). Poll training job. Spin down GPU when complete.

ELT = data engineering. Model architecture & training design = MLE. MLOps is the storage of the training data, monitoring of the whole process, caching of model for use in serving and deployment, and retiring of resources. MLOps has some overlap with dataops, e.g. caching of training data, serving of model as application, but monitors for different things like data/concept drift.

link

bostonsre 1178 days ago

You don't want idle containers on gpus. Something like kserve which sits on knative which is similar to aws lambda is pretty useful and allows scaling deployments to 0. There is some request buffering before the containers and scaling based on the number of concurrent requests a container can support since almost all of these deployed model inference services are gpu and cpu bound, you don't want to route more requests than it can handle because cpu/gpu contention harms throughput.

link