| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gardnr 1023 days ago

Two important takeaways on the base model:

* scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2

* was trained from the beginning with a 16k context using a modified RoPe where many models are simply fine-tuned using RoPe to gain longer context windows after the base model has been trained at 4k.

Can anyone share ideas on how important the 2nd one is? Do LLMs benefit from large context windows using RoPe during pretraining?

3 comments

sbierwagen 1023 days ago

phi-1 supposedly does 50.6 on HumanEval with 1.3B parameters. (Python only) https://arxiv.org/abs/2306.11644

Weights haven't been released, though.

link

euclaise 1022 days ago

phi-1 is a code-specific base model, with further finetuning on top of that. This is a general language base model, not really comparable.

link

imjonse 1023 days ago

no code or dataset either for phi-1.

link

swyx 1022 days ago

its not so much about benefit, as it is a design goal to want large context windows.

https://twitter.com/suchenzang/status/1699926157028897078?s=... notes some issues directly comparing the 16k context number. the odd choice of tokenizer means its effectively like a 10-12k model (? ballpark, not calculated)

link

euclaise 1021 days ago

That tweet had it backwards, more tokens in tokenizer means that the 16k token context window typically allows for even longer passages than if LLaMA were 16k

link

craigacp 1022 days ago

There's a correction to that tweet, larger vocab means fewer tokens for any given sequence (usually, assuming it's not to add other languages or character sets).

link

coder543 1022 days ago

> scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2

The article claims 18.9 for the base model, but also claims 20.7 for the fine tuned model.

link