Hacker News new | ask | show | jobs
by tra3 1229 days ago
This is fascinating.

Can I train it on 5 years of stream of consciousness morning brain dumps and then say "write blah as me"?

Before I do that, I'd love to know if training data becomes part of the global knowledge base available to everyone..

3 comments

This is not a fine-tuning example. It's an embedding search example. You use the embeddings to search for relevant knowledgebase chunks and then include them in the prompt. Which goes to the original model, not a model that you have trained more.

This is popular because it's much much easier to do effectively than fine tuning and the OpenAI model is very capable of integrating kb snippets into a response. What I have heard is that it's easy to overdo fine tuning with OpenAI's model and makes more sense when you want a different format of response rather than just pulling in some content.

Having said all of that, they do have a fine-tuning endpoint and I am guessing if you find the right parameters and give it a lot of properly formatted training data then it will be able to do an okay job. I have the impression it is not easy to do either of those things quite right though.

As far as privacy, no they will not share your data when you use the API. ChatGPT is different, they ARE using the inputs to train the model.

> Having said all of that, they do have a fine-tuning endpoint and I am guessing if you find the right parameters and give it a lot of properly formatted training data then it will be able to do an okay job.

Unfortunately, the fine-tuning API cannot be used to add knowledge to the model. It only helps condition the model to a certain response pattern using the knowledge it already has.

Would you be able to do something similar with non-text data (eg. tabular data)? For example, could you give it a bunch of excel files and ask it do give you the total units sold for an e-commerce site?
What I think would make sense would be to do it in two stages, and embeddings probably aren't really what you want. You would want to parse the Excel files into a certain data structure, and quite possibly text completions could help with that.

Put that in a database or some files that you could use Python data science tools with or something.

Then use text completions to translate the natural language query into some short Python program or SQL query etc.

There are already data-focused tools for using OpenAI's newest models for doing this. Search for 'ChatGPT/GPT/OpenAI' data query, SQL, datatable, etc. See the OpenAI Discord #api-projects Discord, I have seen one or two like that.

thanks for the response. unfortunately, the discord is full :sad:
These privacy considerations are highest-priority for any extended roll-out of LLM-based products.

Privacy on the side of model servers would be good. Open source models that can be run locally would be better.

I personally think anything server-side is unacceptable. Only open source and local will fly.
I have considered training a model on about a year’s conversations from my little community’s discord server and ask it so synthesise sentences as if I was writing them.