Have you tried bigger models? Llama-65B can indeed compete with GPT-3 according to various benchmarks. The next thing would be to get the fine-tuning as good as OpenAI's.
I wonder how accurate those benchmarks are in terms of actual problem solving capability. I think there's a major line at which point LLM becomes actually useful and it actually feels like you are speaking to something intelligent and that can be useful for you in terms of productivity etc.