Hacker News new | ask | show | jobs
by azeusCC 556 days ago
I’ve been sticking with PhraseMatcher because it’s simple, fast, and predictable—but your suggestion about using smaller BERT-based models or embeddings like SBERT (sentence-transformers) is intriguing. I’ve avoided LLMs so far because of the computational overhead, but it sounds like even lightweight models can provide significant value.

Out of curiosity, when training models like SBERT or even smaller BERT versions, do you see diminishing returns when working with smaller training sets (e.g., a few thousand annotated job descriptions)? My current dataset isn’t huge yet (10k), so I wonder where that line starts to appear.

I’ll definitely look more into SBERT and segmentation approaches—thanks for sharing those!

1 comments

I have a content-based recommender based on SBERT + SVM, it starts to learn with around 500 examples, I don't think it benefits from having more than about 10,000.

I have also tried fine-tuning BERT models to do the same, it takes at least 30 minutes to make one model (not do all the model selection I do w/ the sk-learn based models) and I never developed a training protocol that reliably did better than my SVM-based model. My impression there was that the small BERT models don't really seem to have a lot of learning capacity and don't seem to really benefit from 5000+ documents but really high accuracy isn't possible with my problem (predict my own fickle judgements, I feel like I am doing great with AUC-ROC 0.78 or so)

Do you think SBERT + SVM is a good fit for handling ambiguous or less common phrases, or do you still end up needing some post-processing rules for edge cases?
I haven't tried classifying anything as small as a phrase (assuming you've extracted it yet) with SBERT+SVM so I really don't know.

Another thing to consider is a T5 model. A T5 model maps strings to strings so it can be trained to take an input like

"Extract the skills from this resume: ..."

with the output like

"Excel, Pandas, Python, Cold Fusion, C#, ..."

and it will try to do the same. You'll probably still find it makes some mistake that drives you up the wall that need some pre- or post- processing.