Show HN: TuFT – Open-source multi-tenant, Tinker-compatible fine-tuning platform

Y	Hacker News new \| ask \| show \| jobs

Show HN: TuFT – Open-source multi-tenant, Tinker-compatible fine-tuning platform (github.com)

1 points by ekzhu 115 days ago

We've been building TuFT (Tenant-unified FineTuning), an open-source platform that lets multiple users fine-tune LLMs on shared GPU infrastructure through a unified API. It's MIT licensed.

*The problem we're solving:* If you have a team or org where multiple people need to fine-tune models, the typical setup is everyone gets their own GPU allocation and manages their own training stack. That's expensive and wasteful — GPUs sit idle between runs, and everyone is reinventing the same wheel.

TuFT provides a single server that manages base models, LoRA adapters, and checkpoint storage, so multiple users can share the same GPU(s) and run training and sampling jobs through a clean API.

*Why Tinker compatibility matters:* We expose a native Tinker-compatible API, so if you're already using the Tinker SDK for fine-tuning, you can point it at a TuFT server and it just works — no code changes needed. This was a deliberate choice to lower the adoption barrier.

*What works today:*

- Single-machine setup with multi-GPU support - LoRA fine-tuning (SFT and RL with GRPO-style training) - Sampling/inference from fine-tuned models - Checkpoint management (save/restore training state and sampler weights) - Redis-based persistence for crash recovery - OpenTelemetry integration for observability - One-line install script, Docker image, or pip install

You can get it running with:

``` /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/agentscope-ai/tuft/main/sc...)" ```

*Where we want to go (and where we'd love feedback):*

Our roadmap focuses on post-training for agentic models — the RL training loop where rollouts involve reasoning, multi-turn conversations, and tool use. Near-term priorities:

- Multi-machine distributed training (FSDP, DeepSpeed, etc.) - Cloud-native deployment on AWS/GCP/Azure/Kubernetes - Serverless GPU runtime with better multi-tenant resource sharing - Longer term: standardized interfaces with agent training environments (WebShop, BrowserEnv, etc.) and automated training pipelines

*What we'd like to hear from you:*

- Does the multi-tenant framing match a real pain point you've experienced? - If you've done RL-based fine-tuning for agents, what were the biggest infrastructure headaches? - Are there integration points or features that would make this useful for your workflow?

We're early and actively iterating, so honest feedback — including "this doesn't solve my problem because X" — is exactly what we need.

Docs: https://agentscope-ai.github.io/TuFT Discord: https://discord.gg/BCNCaQGxBH