|
|
|
|
|
by viig99
2116 days ago
|
|
ML engineer here, team started as a research team, now that we have things in production and have a lot of devops, engineering work, we bifurcated into pods and work on specific bits and pieces, lot of constant fire-fighting though. Re-wrote entire stack from python to C++ threadpool async grpc (is thrift the only good threadpool server implementation available ?), deployed on openshift, used vector + influx + grafana for dashboards / internal model monitors, elastic search for loggings, lot of other tools for validation, filtering for potential training candidates etc. Right now working CI/CD for ml, during training if model finds a better model based on different validation sets, have one click deployment ready for approval etc |
|
> Re-wrote entire stack from python to C++ threadpool async grpc
Incredible. Presumably this is for latency/performance on the inference side?