| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by noosphr 302 days ago
	When you say llms do you mean decoder only models, gpt et al, or encoder only models, bert et al? I've found encoder only models to be vastly better for anything that doesn't require natural language responses and the majority of them are small enough that _pretraining_ a model for each task costs a few hundred dollars.

1 comments

coder68 302 days ago

By LLMs I meant decoder only, e.g. Gemini, Claude, etc. Can you go into more detail on how you're using the encoder models? I'm curious. Typically I have used them for embedding text or for fine-tuning after attaching a classifier head. What are you pre-training on, and for what task?

link

roadside_picnic 302 days ago

> how you're using the encoder models?

In my original comment this is what I was referring to: using the embeddings produced by these models, not using something like GPT to classify text (that's wildly inefficient and in my experience gets subpar results).

To answer your question: you simply use the embedding vector as the features in whatever model you're trying to train. I've found this to get significantly superior results with significantly less examples than any traditional NLP approach to vector representation.

> What are you pre-training on, and for what task?

My experience has been that you don't need to pretrain at all. The embeddings are more information rich than anything you could attempt to achieve with other vector representations you might come up with using the set of data you have. This might not be true at extreme scales, but for nearly all traditional nlp classification tasks I've found this to be so much easier to implement and so much better performing there's really not a good reason to start with a "simpler" approach.

link

coder68 302 days ago

Ah yes this does make sense. We are definitely in agreement on the point of "wildly inefficient and subpar". I'll try out decoder model embeddings soon, e.g. Qwen/Qwen3-Embedding-8B. I'm working with largish amounts of data (200M records), so I tried to pick a good balance between size:perf:cost, using BAAI/bge-base-en-v1.5 to start (384 dim).

link