| The post really reminds me of a component of a platform I’m currently building. The problem really with this is finding not just good questions that do not discriminate individual models but also providing a good sample size (eg not just 60) to get really some meaningful results. And even if you have those, there is a drift in the quality of responses. I'm the founder of Pulze.ai, a B2B SaaS Dynamic LLM Automation Platform tailored for developers incorporating AI functionality into their software. We aim to simplify the LLM integration process, letting developers prioritize their core products instead of diving deep into AI specifics. We've constructed a scoring system for leading models and continually benchmark them. Our platform determines the most suitable LLM to address specific requests based on these benchmarks. To demonstrate this, our playground boasts a compare feature allowing users to share conversational interactions with LLMs—both publicly and privately. As the context changes, we pinpoint various models for responses. These shared conversations can be forked and extended. Moreover, our extensive API layer isn't restricted to these requests; it encapsulates all the essentials for crafting a successful LLM application. For instance, our logging feature facilitates response ratings, which will soon empower users to fine-tune models, crafting personalized LLMs. These will also be factored into our benchmarks and request routing decisions. Concerning the comment on LLM benchmarks, I completely concur. Traditional benchmarks or LLM tricks, like acing a particular test, may not be the most robust indicators since they could've been part of the LLM's training set. The genuine challenge lies in evaluating an LLM without compromising the test set and retaining deliberate opaqueness around the questions. Trust issues indeed! Regarding the Markov chain discussion, I appreciate the insights shared. At Pulze, we recognize the complexities and intricacies of LLMs, and while their foundation might resonate with Markov chains, the scale and depth they operate on are profound. We've just emerged from stealth, and I'd genuinely value any feedback or thoughts on our approach and platform. Thanks for taking the time! |
Playground and account are for free