Hacker News new | ask | show | jobs
by kaushik92 719 days ago
We identified and solved for 2 key problems with generating data using GPT: 1. Duplicate/similar data points - we solve this by adding deduplication to our pipeline. 2. Incorrect question-answers - we check for correctness and context relevance. Filter out incorrect rows of data.

Apart from this, we generate a diverse set of questions including complex reasoning and chain of thought.

We also generate domain specific unsafe questions - questions that violate TnC of the particular LLM to test the model guardrails.