| It depends on your starting point. A baseline level of ML is needed. Otherwise ML platforms account for three basic functions: features/data, model training, and model hosting. So do an end-to-end project where you: - start from a CSV dataset, with the goal of predicting some output column. A classic example is predicting whether a household's income is >$50K or not from census information. - transform/clean the data in a jupyter notebook and engineer features for input into a model. Export the features to disk into a format suitable for training. - train a simple linear model using a chosen framework: a regressor if you're predicting a numerical field, a classifier if its categorical. - iterate on model evaluation metrics through more feature engineering, scoring the model on unseen data to see its actual performance. - export the model in such a way it can be loaded or hosted. The format largely depends on the framework. - construct a docker container that exposes the model over HTTP and a handler for receiving prediction requests and transforming them for input into the model, and a client that sends requests to that model. That'll basically get an entire end-to-end run the entire MLE lifecycle. Every other part of development is a series of concentric loop between these steps, scaled out to ridiculous scale in several dimensions: number of features, size of dataset, steps in a data/feature processing pipeline to generate training datasets, model architecture and hyperparameters, latency/availability requirements for model servers... For bonus points: - track metrics and artifacts using a local mlflow deployment. - compare performance for different models. - examine feature importance to remove unnecessary (or net-negative) features. - use a NN model and train on GPU. Use profiling tools (depends on the framework) and Nvidia NSight to examine performance. Optimize. - host a big model on GPU. Profile and optimize. IMO: the biggest missing piece for ML systems/platform engineers is how to feed GPUs. If you can right-size workloads and feed a GPU with MLE workloads you'll get hired. MLE workloads vary wildly (ratio of data volume in vs. compute; size of model; balancing CPU compute for feature processing with GPU compute for model training). We're all working under massive GPU scarcity. |
curious: which part of the pipeline does the majority of 'business' value come from?