| Disclosure: I work on Google Cloud (but my advice isn’t to come to us). Sorry to hear that. I’m sure it’s super stressful, and I hope you pull through. If you can, I’d suggest giving a little more information about your costs / workload to get more help. But, in case you only see yet another guess, mine is below. If your growth has accelerated yielding massive cost, I assume that means you’re doing inference to serve your models. As suggested by others, there are a few great options if you haven’t already: - Try spot instances: while you’ll get preempted, you do get a couple minutes to shut down (so for model serving, you just stop accepting requests, finish the ones you’re handling and exit). This is worth 60-90% of compute reduction. - If you aren’t using the T4 instances, they’re probably the best price/performance for GPU inference. If you’re using a V100 by comparison that’s up to 5-10x more expensive. - However, your models should be taking advantage of int8 if possible. This alone may let you pack more requests per part. (Another 2x+) - You could try to do model pruning. This is perhaps the most delicate, but look at things like how people compress models for mobile. It has a similar-ish effect on trying to pack more weights into smaller GPUs, or alternatively you can do a lot simpler model (less weights and less connections also often means a lot less flops). - But just as much: why do you need a GPU for your models? (Usually it’s to serve a large-ish / expensive model quickly enough). If you’re going to be out of business instead, try cpu inference again on spot instances (like the c5 series). Vectorized inference isn’t bad at all! If instead this is all about training / the volume of your input data: sample it, change your batch sizes, just don’t re-train, whatever you’ve gotta do. Remember, your users / customers won’t somehow be happier when you’re out of business in a month. Making all requests suddenly take 3x as long on a cpu or sometimes fail, is better than “always fail, we had to shut down the company”. They’ll understand! |
I stopped using gpu's, "Vectorized inference isn’t bad at all!". This soo much, I was blinded with gpu speed, using tensorflow builds with avx optimization is actually pretty fast.
My discovery:
+ Stop expensive GPU's for inference and switch to avx optimized tensorflow builds.
+ Cleaned up the inference pipeline and reduced complexity.
+ Buying compute instance for a year or more provides a discount.
- I never got pruning to work without a significant loss increase.
- Tried spot instances with gpu's that are cheaper. Random kills and spinning up new instances took too long loading my code. The discount is a lot, but I couldn't reliable get it up. Users where getting more timeouts. I bailed and just used cpu inference. The gpu was being underutilized, using cpu only increased the inference to around 2-3 seconds. With the price trade off it was a more simpel,cheaper and easier solution.