Hacker News new | ask | show | jobs
by rkwz 924 days ago
> I was particularly interested in testing models’ ability to reason (i.e., perform a somewhat complex task that requires high context understanding) about out-of-distribution (i.e., unseen) data.

I was under the assumption that finetuneing LLMs was useful only when you need to change the model's tone (speak like a pirate, voldemort etc).

Are there other examples where LLMs were trained to reason a particular way?

6 comments

Check our Orca. IIRC, it's a technique that aims to encode additional logical capabilities into smaller models by having larger models generate step-by-step solutions to various problems. This doesn't just make them speak more like GPT-4/3.5, but is supposedly making them think more like it as well.
You can get a standard LLM to change tone just by giving it a system prompt/instruction to follow a certain tone.

The only issue there is that sometimes the RLHF seeps through, which can be solved by system prompting even harder.

> I was under the assumption that finetuneing LLMs was useful only when you need to change the model's tone (speak like a pirate, voldemort etc).

A lot of why I tried this out was to test the limits of this belief, you see a lot of talk like this out there and it sounded like nonsense to me.

Finetuning is fundamentally not much different than continued pretraining; if you feed the model high-quality and high-volume data I think it's reasonable to expect it to acquire new skills

In order to speak like a pirate, it has to be able to reason :) I've done some fine tunes as well similar to the MTG example, in mine I was fine tuning it to speak JSON and reason about some input- and yes, you can indeed get these models to perform on novel tasks.
Finetuning is a useful workaround for cases when the context size is unsuitable for the task at hand. Anybody knows whether it was ever considered to finetune an LLM on the Linux kernel sources' history and its associated mailing lists?
Aren't a lot of base models fine-tuned with (Q)Lora on instruct-based datasets with good results? I thought this was a very common practice?