| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by asampat 3027 days ago

We faced the issue of building CV models a lot as grad students but at that time reliability wasn't really something we had to solve for. Once we had to implement them for industrial applications we found we had to ensure there was reproducibility and versioning throughout the process. Now that we have put a few computer vision algorithms into production we decided to architect our code in the following way.

1) training code is written with normal packages (we tend to prefer keras with a tensorflow backend), these are trained on our own GPUs since this is often the cheapest and are done in bulk --> side note is that TPUs/GPUs may one day be better but certainly too expensive currently.

2) prediction / inference code will also be written in the similar packages but will be tied with the final weights files that we get from the model

3) deployment code -- in order to enable a reliable system we have used celery for distributed queueing and converting our prediction functions into celery tasks which can then me passed to workers that can process and return the result to an API endpoint. This allows us to scale our workloads as needed with throughput (depending on request requirements)

This architecture allows us to test during training time using our training code and validation sets, while also enabling testing of different models versions through our prediction APIs. We often would just write scripts for testing that we then run via our CI/CD workflow.

Tip: keep your code as simple as possible and don't rewrite your code unless your throughput requirements mandate it. good error handling will go a long ways here and likely will be easier if written in a language you're most familiar with.