|
|
|
|
|
by typpo
777 days ago
|
|
Paul's benchmarks are excellent and they're the first thing I look for to get a sense of a new model performance :) For those looking to create their own benchmarks, promptfoo[0] is one way to do this locally: prompts:
- "Write this in Python 3: {{ask}}"
providers:
- ollama:chat:llama3:8b
- ollama:chat:phi3
- ollama:chat:qwen:7b
tests:
- vars:
ask: a function to determine if a number is prime
- vars:
ask: a function to split a restaurant bill given individual contributions and shared items
Jumping in because I'm a big believer in (1) local LLMs, and (2) evals specific to individual use cases.[0] https://github.com/typpo/promptfoo |
|