|
|
|
|
|
by tomrod
1178 days ago
|
|
Operationally very simple: ELT -> GUID-based naming convention on S3 or Lustre on FSx (name and keep if preserving data, not replication steps) -> Point GPU instance to data (e.g. Sagemaker can transfer data stored on S3 with different approaches and costs, YMMV). Poll training job. Spin down GPU when complete. ELT = data engineering. Model architecture & training design = MLE. MLOps is the storage of the training data, monitoring of the whole process, caching of model for use in serving and deployment, and retiring of resources. MLOps has some overlap with dataops, e.g. caching of training data, serving of model as application, but monitors for different things like data/concept drift. |
|