| Excellent, appears Amazon has introduced two important things here: - Rope theta of 100,000, likely from the Llama 2 Long paper which found that a large theta helped regulate attention between distant tokens[0] - A 16k (effective 32k) context window, improving upon Mistrals 4k (effective 8k) context window In The Llama 2 Long paper, they saw improvement in short context benchmarks as a result of long context fine tuning. I can't find any of the expected MMLU / HellaSwag / etc benchmarks yet. Benchmarks haven't been submitted to MTEB yet. Some user anecdotally seem to be having trouble with generating quality responses [2][3][4]. I can't find any examples of users getting good results from the model outside of using exact examples from the documentation. [0] https://arxiv.org/pdf/2309.16039.pdf [2] https://old.reddit.com/r/LocalLLaMA/comments/17jd00g/mistral... [3] https://old.reddit.com/r/LocalLLaMA/comments/17kzlbl/anyone_... [4] https://old.reddit.com/r/LocalLLaMA/comments/17b0n8t/llama_2... |