| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Grimblewald 793 days ago
	Even llama3 has its issues. Ive been quite impressed so far but if the context gets a little long it freaks out, gets stuck repeating the same token or just fails to finish an answer. This is for the full f16 8B model, so it cant be put down to quantization. It also doesnt quite handle complex instructions as well as the benchmarks would imply should.

2 comments

andai 793 days ago

Supposedly LLMs (especially smaller ones) are best suited to tasks where the answer is in the text, i.e. summarization, translation, and answering questions.

Asking it to answer questions on its own is much more prone to hallucination.

To that end I've been using Llama 3 for summarizing transcripts of YouTube videos. It does a decent job, but... every single time (literally 100% of the time), it will hallucinate a random name for the speaker.* Every time! I thought it might be the system prompt, but there isn't one.

My own prompt is just "{text}\n\n###\n\nPlease summarize the text above."

If I ask it to summarize in bullet points, it doesn't do that.

I'm assuming there was something in the (instruct) training data that strongly encourages that, i.e. a format of summaries beginning with the author's name? Seems sensible enough, but obviously backfires when there's literally no data and it just makes something up...

*In videos where the speaker's name isn't in the transcript. If it's a popular field, it will often come up with something plausible (e.g. Andrew Ng for an AI talk.) If it's something more obscure, it'll dream up something completely random.

link

kiratp 793 days ago

The technique to use is to give the model an “out” for the missing/negative case.

"{text}\n\n###\n\nPlease summarize the text above. The text is a video transcript. It may not have the names of the speakers in it. If you need to refer to an unnamed speaker, call them Speaker_1, Speaker_2 and so on."

link

woodson 793 days ago

Especially for small models I had very bad results for use in translation. Even trying all kinds of tricks didn’t help (apparently prompting in the target language helps for some). Encoder-decoder models such as FLAN-T5 or MADLAD-400 seemed far superior at equal or even smaller model size.

link

andai 793 days ago

I forget which model (LLaMA 3?) but I heard 95% of the training data was English.

link

Grimblewald 792 days ago

for sure, so my use case for example is

"using the following documentation to guide you {api documentation}, edit this code {relevant code}, with the following objective: Replace uses of {old API calls} in {some function} with with relevant functions from the supplied documentation"

It mostly works, but if the context is a little to long, sometimes it will just spam the same umlaut or number (always umlaut's or numbers) over and over for example. Perhaps some fine-tuning of parameters like temp. or repetition penalty might fix it, time will tell.

link

andai 792 days ago

Are you using ollama? Devs said there was a bug that occurs when context is full, they're working on it.

link

Grimblewald 780 days ago

That would do it, I am indeed.

link

behohippy 793 days ago

I had this same issue with incomplete answers on longer summarization tasks. If you ask it to "go on" it will produce a better completion, but I haven't seen this behaviour in any other model.

link

Grimblewald 792 days ago

Neither, still, the answers it does provide - despite a few hiccups - are truly outstanding. I am really impressed with this model, even with its issues. Though, I am sure the issues such that they are, are a month or two away from being fixed. For what its worth, I haven't played as much with the bigger model, but it seems to not struggle with the same, though take that with a grain of salt, it runs too slow on my hardware for me to rapidly test things.

link