Hacker News new | ask | show | jobs
by coder68 302 days ago
I have been working on text classification tasks at work, and I have found that for my particular use-case, LLMs are not performing well at all. I have spent a few thousand dollars trying, and I have tried everything from few-shot to asking simple binary yes/no questions, and I have had mixed success.

I have stopped trying to use LLMs for this project and switched to discriminative models (Logistic Regression with TFIDF or Embeddings), which are both more computationally efficient and more debuggable. I'm not entirely sure why, but for anything with many possible answers, or to which there is some subjectivity, I have not had success with LLMs simply due to inconsistency of responses.

For VERY obvious tasks like: "is this store a restaurant or not?" I have definitely had success, so YMMV.

6 comments

When you say llms do you mean decoder only models, gpt et al, or encoder only models, bert et al?

I've found encoder only models to be vastly better for anything that doesn't require natural language responses and the majority of them are small enough that _pretraining_ a model for each task costs a few hundred dollars.

By LLMs I meant decoder only, e.g. Gemini, Claude, etc. Can you go into more detail on how you're using the encoder models? I'm curious. Typically I have used them for embedding text or for fine-tuning after attaching a classifier head. What are you pre-training on, and for what task?
> how you're using the encoder models?

In my original comment this is what I was referring to: using the embeddings produced by these models, not using something like GPT to classify text (that's wildly inefficient and in my experience gets subpar results).

To answer your question: you simply use the embedding vector as the features in whatever model you're trying to train. I've found this to get significantly superior results with significantly less examples than any traditional NLP approach to vector representation.

> What are you pre-training on, and for what task?

My experience has been that you don't need to pretrain at all. The embeddings are more information rich than anything you could attempt to achieve with other vector representations you might come up with using the set of data you have. This might not be true at extreme scales, but for nearly all traditional nlp classification tasks I've found this to be so much easier to implement and so much better performing there's really not a good reason to start with a "simpler" approach.

Ah yes this does make sense. We are definitely in agreement on the point of "wildly inefficient and subpar". I'll try out decoder model embeddings soon, e.g. Qwen/Qwen3-Embedding-8B. I'm working with largish amounts of data (200M records), so I tried to pick a good balance between size:perf:cost, using BAAI/bge-base-en-v1.5 to start (384 dim).
If I have 1,000 labeled examples for a classification task, I’ll expand that into a training dataset using augmentation, and then finetune a small model like RoBERTa. It’s fast, cheap, accurate — and predictable.

Others have had success with SetFit as the training framework and Ettin as the base model.

oh this seems like an interesting idea, what tactics do you use for augmentation? For my own use-case, I think I could reorder semantic chunks, or maybe randomly delete pieces, but curious what tactics you use!

I have also considered training a small language model for synthetic data generation.

Yes, exactly. You want to randomize the parts that are irrelevant. For example, if you're classifying news articles, you may want to shorten them anyway. A human would be able to tell what category an article belongs to without reading the whole thing - so may do a combination of URL, headline, beginning, middle, and/or end. And if you do that, it's easy to turn one training example into 10 or more. You just vary the length of the individual parts.
It depends on a lot of things but to add to your possible setups you can potentially improve results by using simpler systems for first answers and falling back afterwards.

For example:

If contains cafe and not internet/cyber/etc -> restaurant

No -> (tfidf) -> yes, no, unsure

unsure -> embeddings -> yes, no, unsure

unsure -> llm -> yes, no, unsure

unsure -> human queue ->...

I think the idea of backoff by ratcheting up complexity here is a very good idea, thanks for your suggestions.
Happy to help - this is a thing I’ve employed multiple times for real cases.

One big benefit is that it uses the cheapest and most understandable approaches for the majority of cases, and scales up quite nicely. It has a neat place for very custom issues to be fixed too.

There will always be some things that simple approaches think are clear but aren’t, which is awkward but then all pipelines end up with that somewhere.

Edit - you can also deploy things earlier if you start from the beginning of the chain. Moving from big deploy to iteration on the remaining issues is often a win just in deployment issues.

To chime in about where I'm at -- one problem was solved with a statistical classifier, but to bootstrap another, I ended up using keywords. It took a few hours to get a reasonable solution, and it leans more towards precision than recall, but it worked quickly!
At my work, we still prefer to use distilbert for text classification. It almost always does well with a little bit of fine tuning. In very rare cases, we use LLMs/Agentic setup when the task involves refering both images and text and the same time.
I can confirm that Distillbert has worked well when I have used it for classification, especially on shortish sequences. I'm really interested in trying out ModernBert, or a smaller variant due to the larger context window (8192 tokens).
I was thinking of trying ModernBERT for one of my projects. But I can only conclude after seeing the performance for my usecase. Do you think ModernBERT will be capable of expanding abbreviated sentences?
Doesn't that mean having to go back to manually labeling examples? That can be a big hurdle compared to just zero-few shotting some stuff into the LLM prompt. Unless there's something I'm misunderstanding about your approach. Or maybe it's possible to do an unsupervised clustering step on the vectors to get the labeled categories that you can then pass to the supervised classification model. Though I guess that would depend on how strictly defined the target categories are for the use case in question.
To some degree manual labeling has to be done anyway, just to validate that any approach works at all, you'll always need ground truth from somewhere. What I suggested is that zero/few-shotting might not be good enough, depending on the problem. Labeling ~1000 samples isn't too bad, I've done it by hand a few times now. If you can source a high quality positive signal from somewhere (e.g. user-behavioral data), even better.
Are your categories fixed? If so you could constrain the output using enums in structured outputs.

re: inconsistencies in output, OpenAI provide a seed and system_fingerprint options to (mostly) produce deterministic output.

The outputs are working correctly in terms of formatting, but the answers themselves may be inconsistent. I have experimented with varying the prompt and the answers can change dramatically. I could experiment with lowering temperature, but I just don't think generative models were a good fit for the problem. The appeal is the speed of prototyping and no need for training data, but it honestly didn't take much for my problem: one afternoon and ~1000 samples labeled got me to a good baseline.