Hacker News new | ask | show | jobs
by jeremy_k 1155 days ago
I can't say I'm very well versed in all of this but I was asking my coworkers today about whether embeddings were the way forward or if doing your own training would be more beneficial. Or even yet, could you take an open source model and train it specifically on just your content; would that wield better results?

Expanding context seems like an approach, but if you're trying to get an answer about your company's documentation, why would you need the entirety of GPT-X?

2 comments

Every time I've asked this question the answer has been that injecting relevant content into the prompt provides much better results than attempting to fine-tune a model on your own content.

Here's a relevant quote: https://simonwillison.net/2023/Apr/15/ted-sanders-openai/

Thanks for that. The taking a test with open notes analogy makes a lot of sense.

Given that knowledge, as an end user it seems I would want to spend my time ensuring that the embedding data being selected is as good as possible.

The broad general training of GPT-X (and fine tuning on your content) provides context and (loosely speaking, at least) “analytical” ability, search-via-embeddings to inject material into the prompt provide exact recall of specific material, with capacity greater than the context limit.

Analogous, more or less, to a human with general experience (base training), experience with your code base (fine tuning), and the ability to reference the current code base directly (embedding-based search/recall). All three have a role, they are complementary rather than mutually exclusive.

Thanks for the explanation. Do you think that because GPT-X will likely have more base training than an open source model someone attempts to train themselves, the outcomes may end up being better if say the fine tuning and embedding were the same for both options?