| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by monatis 105 days ago

We kept running into the same exact bottleneck with fine-tuning and evals: You have the source documents, and you have the base model, but you usually don’t have the actual conversations.

If you’re working with internal docs, regulatory text, or technical manuals, there’s plenty of material but zero multi-turn chat logs. And flattening this into standard instruction/response pairs creates models that sound like templates, failing to capture how users actually ask for clarification or push back.

So we open-sourced a small, opinionated library called AfterImage.

It generates synthetic multi-turn conversations grounded in a corpus you provide. The architecture is straightforward: - A simulated user ("Correspondent") with optional persona variation - A simulated assistant ("Respondent") - Both strictly grounded via sampled source material - Outputs directly to JSONL for your SFT (Supervised Fine-Tuning) / eval pipelines

*Why build this?* The narrow bet here is that multi-turn dialogue is its own distinct data problem. There are already great general synthetic data tools (distilabel, synthetic-data-kit). We aren't competing with them. AfterImage prioritizes composable design where generation can be customized with callbacks. For example, you can connect it to various data sources such as local files or Qdrant collections, or you can choose retriever strategies for RAG or aggregation methods for composite evaluation.

*A few honest caveats:* - We don’t have a strong published benchmark yet (semantic similarity only so far). - Quality noticeably degrades/loops as conversations get too long (>5+ turns). Luckily, one-to-three turns is more than enough for most SFT cases.