From my experience "single prompt classification" isn't as simple as "type in sentence and it works" in practice. But you can use some methods to massively improve it's consistency/output.
I cannot recommend guidance enough. You can use shockingly small Llama models for some tasks with guidance while only actually generating a handful of tokens.
You should highly consider some form of guidance/logit bias for classification especially if you have a known set of classes. This will ensure you get it in the format that you want, with the correct classes that you want.
Keep in mind LLMs perform much better with COT. So you make it explain what the text/image is, then explain the possible classifications, then list its final decision. Again guidance can ensure it follows the correct format to do this.
LLM's still massively benefit from finetuning, especially if you want too classify it in a particular format. Notebook tags vs SFW/NFSW vs important subjects, etc. Existing alignment can sometimes mess with some of these classifications too which finetuning helps smooth out.
Yeah totally agree. We've found that a ton of OpenAI usage in practice is a variant of either classification or information extraction. This makes sense -- going from a human-native form of information (free text) to a computer-native form of information (structured data) is a key component of many pipelines!
Of course, GPT-4 is insanely expensive to use at scale, and still isn't a perfect classifier. So the next step is to take the outputs you get from GPT-4 and use them to fine-tune a smaller model that's really fast and good at your specific problem. In my experience, even without using any human annotations or online learning, a model fine-tuned just on GPT-4 outputs can actually outperform GPT-4 as a classifier! This seems really counterintuitive at first, but my guess is what's happening is that the training process is a kind of regularization, so the weird mistakes GPT-4 occasionally makes are overwhelmed in the training data by all the times when GPT-4 gets it right.
As a disclaimer, we're building open source tooling to ease the transition from prompt to cheaper fine-tuned model at my company OpenPipe.
Approximately nobody cares about the TOS of large language models sold for money that were trained by copying the content of everyone else with 0 compensation.
Between open source modeling tools being incredible, transfer learning allowing dirt cheap fine-tuning and now mega-models being able to instantly give you a "mostly right" data set, the cost of creating ML features has dropped to almost nothing.
Products that took quarters/years and required big budgets for labeling, ML specialist, GPUs etc just a few years ago can now be done in an hour or so for free (if you are scrappy). I imagine this is going to lead to a ton of great ML features that weren't worth funding in the past but are very valuable in aggregate. Similar to the mid-2000s when the cost/ease of web development came down enough that there was a lot more experimentation and fun to be had.
Agreed! I think it's going to be hard for software developers to adjust to the "data science/engineering" mindset of monitoring and iterating on the long tail of maintenance. A lot of teams already have this issue with deterministic code running in production. I think there's a big opportunity to help purely software teams to learn and adjust.
Honestly, this was one of the first things that excited me with chatgpt. I'm really eager to see a high performance inference engine that can keep up with my log data.
Being able to teach an AI assistant to look for specific (but not too specific) things with just a prompt would be incredibly helpful.
Another thing you can do with LLMs that I think is pretty interesting is use them to train a cheaper and faster model. Then use the faster model in your application.
We’re doing this pretty successfully to identify products from massive text content, and even more importantly, we then perform a second pass and let the models categorise the identified products, and then do a third pass to build a category hierarchy. This gets us a full product taxonomy with practically no sweat. It’s amazing, really.
Nice one. Have you thought of stripping the text of words which do not contribute much to the meaning of a sentence? This way you could squeeze the context window even more.
I have written some stories myself, with the help of GPT, i will try to parse my stories with your method. It is very interesting.
As a side note, GPT is definitely not a toy. I use it for coding, it is great! I use it to write command line apps, which do some simple data manipulation, some more complex than others, but in the order of hundreds of lines of code. They work flawlessly, without me writing even a single line of code.
I'm actually looking forward to this because the result is going to be hilarious-- the culmination of literally every slippery-slope argument ever as the models reinforce their own biases over time.
You can use Particlesy to create a custom GPT-4 bot trained just to classify csv row data and integrate systems.
One interesting use case we see is a SaaS company using the our REST API to access a Particle with custom instructions just for integration with other systems. They will provide a CSV row and the GPT-4 model will classify and map the columns into their key columns. In effect, they are able to integrate with almost any system in their vertical with an out-of-the-box integration. Albeit, it is more expensive, but it is great for the initial trial phase and the costs can be passed to the customer. https://www.particlesy.com
I'm waiting for a well-optimized LLM-based system built-into a local editor like obsidian and I can ask it scan my entire local Documents folder and then it supercharge my reading/writing locally.
We've found the same. A lot of usage through our LLM Categorization endpoint. The toughest problem was actually constraining the model to only output valid categories and not hallucinate new ones. And to only return one for single-classification (or multiple if that's the mode).
With everyone talking about LLMs being glorified autocomplete, I actually would like to see how well they perform as autocomplete. Because most built-in ones are pretty bad.
The difference there is that you can probably look at (or design a test for) that hodge podge of regexes and understand the range of outputs.
You can prompt gpt4 and get something that looks plausible for a few test cases with very little effort, but can you get any guarantees that it will behave reasonably for most inputs? And if you can, will those guarantees last as the model is updated underneath you?
I would be very worried that the LLM would say something medically wrong, and we'd get sued for a lot of money. ISTM that a better thing to do is to use the LLM to generate a lot of training data that you then test your handwritten super-regex against.
Yes. You can also do retrieval on your set of classified examples so the prompt only contains the most similar examples, and as your set grows, the prompt becomes more useful. (Even if this is still not as good as finetuning would be.) Note that you can do multiple prompts if you are willing to pay for even more accuracy, to ensemble different prompts.
I cannot recommend guidance enough. You can use shockingly small Llama models for some tasks with guidance while only actually generating a handful of tokens.
You should highly consider some form of guidance/logit bias for classification especially if you have a known set of classes. This will ensure you get it in the format that you want, with the correct classes that you want.
Keep in mind LLMs perform much better with COT. So you make it explain what the text/image is, then explain the possible classifications, then list its final decision. Again guidance can ensure it follows the correct format to do this.
LLM's still massively benefit from finetuning, especially if you want too classify it in a particular format. Notebook tags vs SFW/NFSW vs important subjects, etc. Existing alignment can sometimes mess with some of these classifications too which finetuning helps smooth out.