We get this question a lot! We work hand-in-hand with obs tools like Langfuse.
Langfuse is great for debugging technical issues on individual traces like timing conditions that resulted in failed API calls.
Voker focuses on product, business and user outcomes - like what intents did the user bring to your agent that you might not expect. We're built for the whole product team, whereas Langfuse focuses on engineers specifically.
One way to think about it would be: a PM notices in Voker that a new intent category is coming up frequently and the agent isn't handling it well. The PM can dig into the data with visualizations or our conversation reconstructions. Once they confirm its a real issue worth addressing, they can link their investigation to the AI engineer - who can use Voker AND Langfuse to debug and implement a fix/improvement.
Maybe as a comment, you really put weight on intent classification. I am not sure why. For it to work, you are gonna need my expert domain input. And given that, I feel like the classification bit is basically solved. I wonder a bit why this is the feature you seem to put front and center (e.g. screenshots)
tl;dr: Langsmith + homegrown intents doesn't scale with contributors and agent usage as an Analytics solution. Voker adds trend and usage insights on collaborative dashboards that work for the whole AI product team.
Nice, sounds like you've set up your own solution in house. We definitely see some teams do that, and for some it works perfectly, for others, its too expensive to maintain - they get new requests for new dashboards or different subcuts of data from product or design teams, or they run into an issue like way too many intents generated to be useful, and its not worth the tradeoff of investing time in building internal tooling. But for some it makes sense to roll your own! It also really depends on how many people on the team are involved in building the agent products, and how much volume your agents have. If you have millions of conversations a month with thousands of unique intents, you have to set up data eng pipelines just to process categorize, and store all that data in a way thats usable for the whole team.
When it comes to Langsmith, we hear about them a lot from our customers, pretty much all of them love it as an obs tool, but most say that only the engineers have access or spend time in it, and they've told us the strength of Langsmith is its technical tracing, not its visualizations, ui, or usability. They've told us any "insights" are very canned (because thats not Langsmith's key focus).
We add self-serve analytics - like how Google Analytics lets marketers see how their website is performing without needing to ask engineers to write SQL queries on cloudwatch logs.
Ex: PM can self-serve and look at trends in what users are asking of agents, notice a problem, do a quick RCA, look for reproducibility across other sessions - before deciding to assign as an issue to engineer. Old way would be: PM hears a complaint from a customer, asks the engineer to "look into it" and the eng spends 4 hours combing through Langsmith logs to hunt down one session without even knowing if its actually a widespread issues
do you have experience as PMs? Looking at website, it looks like you just use llms to guess what categories are? Seems like trap for garbage in garbage out. Otherwise you would need someone technical to figure out how to setup the proper KPI monitoring things...
We do! We have combined experience as PMs, ml engs, and data scientists across many verticals. We also have experience helping PMs and AI eng teams build agents across over 100 customers from our first product.
You're totally right, the analytics annotation primitives we detect (intents, corrections, resolutions) are the cornerstone to all the other analysis in our platform. It's critical that we get those right or all the data and insights in the world are useless.
LLMs are a core part of that detection, but we also do things like hierarchical classification, (https://voker.ai/blog/hierarchical-text-classification-with-...) and will eventually add in other ML methods where applicable. On top of our automated detections, we're building ways for the annotations to improve and adapt to your specific agent product, your data, and your feedback on our annotations.
Our SDK is architected to eventually accept any type of event you want to send as additional information like add to carts, or other conversion metrics that are valuable for analysis on agent performance.
You're definitely right, we don't expect a PM to instrument this all themselves - similar to web analytics or product analytics tools, the engineering team instruments and maintains the integration, and then our app makes the insights and data accessible to not just the engineer but the whole product team.
Voker focuses on product, business and user outcomes - like what intents did the user bring to your agent that you might not expect. We're built for the whole product team, whereas Langfuse focuses on engineers specifically.
One way to think about it would be: a PM notices in Voker that a new intent category is coming up frequently and the agent isn't handling it well. The PM can dig into the data with visualizations or our conversation reconstructions. Once they confirm its a real issue worth addressing, they can link their investigation to the AI engineer - who can use Voker AND Langfuse to debug and implement a fix/improvement.