Hacker News new | ask | show | jobs
Show HN: Klarity – OS tool to debug LLM reasoning patterns with entropy analysis (github.com)
3 points by mrciffa 492 days ago
After struggling to understand why our reasoning models would sometimes produce flawless reasoning or go completely off track - we updated Klarity to get instant insights into reasoning uncertainty and concrete suggestions for dataset and prompt optimization. Just point it at your model to save testing time.

Key new features:

- Identify where your model's reasoning goes off track with step-by-step entropy analysis - Get actionable scores for coherence and confidence at each reasoning step - Training data insights: Identify which reasoning data lead to high-quality outputs

Structured JSON output with step-by-step analysis:

- steps: array of {step_number, content, entropy_score, semantic_score, top_tokens[]} - quality_metrics: array of {step, coherence, relevance, confidence} - reasoning_insights: array of {step, type, pattern, suggestions[]} - training_targets: array of {aspect, current_issue, improvement}

Example use cases:

- Debug why your model's reasoning edge cases - Identify which types of reasoning steps contribute to better outcomes - Optimize your RL datasets by focusing on high-quality reasoning patterns

Currently supports Hugging Face transformers and Together AI API, we tested the library with DeepSeek R1 distilled series (Qwen-1.5b, Qwen-7b etc)

Installation: `pip install git+https://github.com/klara-research/klarity.git`

We are building OS interpretability/explainability tools to debug generative models behaviors. What insights would actually help you debug these black box systems?

Links:

- Repo: https://github.com/klara-research/klarity - Our website: [https://klaralabs.com](https://klaralabs.com/) - Discord: https://discord.gg/wCnTRzBE

2 comments

'm curious—how does Klarity handle cases where reasoning errors are not just due to poor training data but also because of inherent limitations in the model architecture or prompt design? Are there specific suggestions for addressing those types of issues, or is the focus mainly on dataset optimization?
We are currently giving broad suggestions with an insight model that can be chosen during the setup. We will try to update and improve the suggestion prompt/code to make them more granular with new releases
how does Klarity scale with more complex models or larger datasets? Does it maintain the same level of insight and actionable suggestions as the model grows in size and complexity? Great release btw
It should work with any type of model, obviously longer chain of thoughts will be more difficult to analyse by the evaluation model, because it will have way more reasoning steps to identify and separate. The quality of the outcome depends a lot on the chosen model to give you insights. We tested with Llama3-70B and worked smoothly most of the times.