|
Summary from https://arxiv.org/pdf/2309.16609.pdf --- (q: how does one format lists on HN?) * qwen-{1.8B,7B,14B}: * 3 trillion tokens; start with BPE tiktoken, cl100k base vocab, augmented with chinese, numbers split into digits, final vocab 152k.
* RoPE - rotary positional embedding
* context length 2048
* qwen-14b perf percentages: 66.3 MMLU(5), 72.1 CEval(5), 61.8 GSM8K(8), 24.8 MATH(4), 32.3 HumanEval(0), 40.8 MBPP(3), 53.4 BBH(3); beats LLaMA2-13B on all, but behind LLaMA2-70B on all except CEval, MATH and HumanEval (somewhat surprising)
* code-qwen-{7B,14B} * additional 90B code tokens over base
* context length 8192, flash attention
* 14B perf: humaneval 66.4, mbpp 52.4; ok, but not stellar (similar numbers as OSS wizardcoder-py, and lower than gpt-3.5)
* math-qwn-{7B,14B}-chat * math instructional dataset
* context length 1024
* 14B perf: gsm8k 69.8, MATH 24.2, Math401 85.0, Math23K 78.4 (substantially better than OSS in the same weight class (WizardMath and GAIRMath-Abel) on MATH but same ballpark on GSM8k -- surprising). Math23K is chinese grade school math; and Math401 is arithmetic ability.
* comprehensive automatic evaluation in Appendix A.2.1 pg 36 (based on OpenCompass'23)* chat format:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hello! How can I assist you today?<|im_end|> |