Hacker News new | ask | show | jobs
by threeseed 747 days ago
But we know from Google that unless you can definitively solve the "is this sentence real or a joke" datasets like Twitter, Reddit etc are going to be more trouble than they are worth.

And Elon's recent polarising nature and the callous nature with which he disbanded the Tesla Supercharger team means that truly talented people aren't going to be as attracted to him as in his early days. They are only going to be there for the money.

3 comments

The datasets should not be used for knowledge but to train a language model.

Using it for knowledge is bonkers.

Why not buy some educational textbook company and use 99.9% correct data? Oh and use RAG while you are at it so you can point to the origin of the information.

The real evolution still has to come though, we need to build a reasoning engine (Q*?) which will just use RAG for knowledge and language models to convert its thought into human language

How does one differentiate knowledge from the language model in an LLM? At least in a way that would provide a benefit?
You use formal verification for logic and rags for source data.

In other words - say you have a model that is semi-smart, often makes mistakes in logic, but sometimes gives valid answers. You use it to “brainstorm” physical equations and then use formal provers to weed out the correct answer.

Even if the llm is correct 0.001% of the time, it’s still better than the current algorithms which are essentially brute forcing.

I’m still confused as to the value of training on tweets though in that scenario?

If you need to effectively provide this whole secondary dataset to have better answers, what value do the tweets add to training other than perhaps sentiment analysis or response stylization?

I still fondly remember the story an OpenAI rep told about fine-tuning with company slack history. Given a question like "Can you do this and that please." the system answered (after being fine-tuned with said history) "Sure, I'll do it tomorrow." Teaches you to carefully select your training data.
>Twitter Supercharger team

interesting.