Show HN: FiddleCube – Generate Q&A to test your LLM | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

Show HN: FiddleCube – Generate Q&A to test your LLM (github.com)

78 points by kaushik92 727 days ago

Convert your vector embeddings into a set of questions and their ideal responses. Use this dataset to test your LLM and catch failures caused by prompt or RAG updates.

Get started in 3 lines of code:

```

pip3 install fiddlecube

```

```

from fiddlecube import FiddleCube

fc = FiddleCube(api_key="<api-key>") dataset = fc.generate( [ "The cat did not want to be petted.", "The cat was not happy with the owner's behavior.", ], 10, ) dataset

```

Generate your API key: https://dashboard.fiddlecube.ai/api-key

# Ideal QnA datasets for testing, eval and training LLMs

Testing, evaluation or training LLMs requires an ideal QnA dataset aka the golden dataset.

This dataset needs to be diverse, covering a wide range of queries with accurate responses.

Creating such a dataset takes significant manual effort.

As the prompt or RAG contexts are updated, which is nearly all the time for early applications, the dataset needs to be updated to match.

# FiddleCube generates ideal QnA from vector embeddings

- The questions cover the entire RAG knowledge corpus.

- Complex reasoning, safety alignment and 5 other question types are generated.

- Filtered for correctness, context relevance and style.

- Auto-updated with prompt and RAG updates.

7 comments

Loic 726 days ago

For the people wondering, the Github repo is only hosting a couple of lines of Python to connect to their API.

If you have your own LLM, you may have sensitive/private data "in" it from your training. You may not be allowed to use this service from a legal point of view.

kaushik92 726 days ago

We do care a lot about data privacy. While the data is sent to an API, we do not store anything on our servers or use our user's data in any way.

We are working on getting SOC2 certified. In the meantime, we sign a legally binding agreement with our users who have data privacy needs/concerns.

mistercow 726 days ago

The bulleted list of what constitutes “ideal” is missing one of the most important types of questions: questions that aren’t answered by the knowledge set, but which seem like they should/might be.

This is where RAG systems consistently fall down. The end user, by definition, doesn’t know what you’ve got in your data. They won’t ask questions carefully cherry-picked from it. They’ll ask questions they need to know the answer to, and more often than you think, those answers won’t be in your data. You absolutely must know how your system behaves when they do that.

kaushik92 726 days ago

This is great feedback! Will work on adding this soon.

johnsutor 727 days ago

How does this differ from Ragas? https://docs.ragas.io/en/latest/index.html

kaushik92 727 days ago

Ragas is an eval tool which needs ground truths and queries for evaluation. FiddleCube generates the queries and the ground truth needed for eval in Ragas, LangSmith or an eval tool of choice.

We incorporate user prompts to generate the outputs and provide diagnostics and feedback for improvement, rather than eval metrics. So you can plug your low scored queries provided by Ragas, your prompt and context. FiddleCube can provide the root cause and the ideal response.

This is an alternative to manual auditing and testing, where an auditor works on curating the ideal dataset.

michalf6 726 days ago

Ragas also has a feature to generate ground truths and queries: https://docs.ragas.io/en/latest/getstarted/testset_generatio... Although simply prompting an LLM with chunks of source documents might work better / cheaper - ragas tends to explode with retries in my experience.

kaushik92 724 days ago

I have seen this, but I find it fairly hard to use.

Our goal is to focus on datasets and make it very easy to create and manage data.

In our next release, we will be launching a way to do this using a UI.

mikeqq2024 725 days ago

Is this done by calling gpt4o with user query, prompt and context to generate result as ground truth, and analysis? If so, what is the value added except perhaps automation?

neha_n 724 days ago

Hi,

While we call LLMs(internal and external, based on instruction type), the output generated by LLMs can't be taken as ground truths unless we do rigorous evaluations. We have our own metrics when it comes to what could be called a ground truth, based on the user's seed information and business logic. Accuracy & preciseness needs also differ from use-case to use case. Function calling adds in another layer.

Another value add is type of instructions that we can generate. We expose 7 currently, and are working on exposing more instruction types. The challenge is to create ground truth of wide variety of cases that a given user can ask for a business including guardrailing.

We have built internal tools and agents to solve for those, and are internally discussing the ideal way to expose it to users, and whether it would be beneficial for the community. Any thoughts on that would be appreciated.

Automation took a significant amount of time for us as well, so at scale, even a reliable automated CI/CD pipeline is indeed a value add in itself.

Lmk if I can add more details to answer the question.

kaushik92 724 days ago

We identified and solved for 2 key problems with generating data using GPT: 1. Duplicate/similar data points - we solve this by adding deduplication to our pipeline. 2. Incorrect question-answers - we check for correctness and context relevance. Filter out incorrect rows of data.

Apart from this, we generate a diverse set of questions including complex reasoning and chain of thought.

We also generate domain specific unsafe questions - questions that violate TnC of the particular LLM to test the model guardrails.

cruxcode 727 days ago

Can it generate HTML as part of prompt?

neha_n 727 days ago

Can you elaborate on the use case a bit? HTML as a part of the prompt for what kind of use case.

cruxcode 727 days ago

I am scraping some information from a list of company's website. I would like to create a evaluation set for my agent.

praveenkumarnew 727 days ago

Can I plug this into ragas pipeline

neha_n 724 days ago

Yes. The outcome can be used as a ground truth for Ragas, Langchain/Llamaindex evals.

aditikothari 727 days ago

This is super cool!

arjun9642 727 days ago

I want to hack