| We recently ran into something frustrating while training and fine-tuning open-weight TTS models. Instead of working on the model itself, we spent days dealing with:
- CUDA version mismatches
- Driver / PyTorch conflicts
- OOM crashes when scaling to multi-GPU
- Broken or outdated open-source training scripts
- Gluing together tracking + eval + deployment manually It felt like we were rebuilding the same orchestration layer every team probably rebuilds.
- Cloud providers give raw GPUs.
- MLOps tools give experiment tracking.
- Open-source gives training scripts. But the end-to-end workflow (dataset → fine-tune → monitor → evaluate → deploy → retrain) still feels stitched together. We’re exploring building an opinionated platform that: Lets you select a base model (e.g. Llama/Mistral-style open models)
1. Upload or connect datasets
2. Choose infra tier
3. Launch LoRA/full fine-tuning
4. Monitor loss + cost in real time
5. Run built-in eval
6. Deploy with one click Basically: abstract away the CUDA + orchestration layer. Before we go too deep, I’d love honest feedback:
- Is this still a painful problem at your company?
- Would serious AI teams use this, or do larger companies just build infra in-house?
- Is this doomed to be a hobbyist tool?
- Where would the real wedge be — training, evaluation, or continuous retraining? We’ve launched a simple landing page and started building, but we’re still early and trying to validate whether this is a real infra gap or just our own frustration. Would appreciate blunt feedback. |
This shouldn't take days and CC can already setup all of this using whatever level of rigor you need.
Your business will get replaced with a prompt.