Hacker News new | ask | show | jobs
by typpo 777 days ago
Paul's benchmarks are excellent and they're the first thing I look for to get a sense of a new model performance :)

For those looking to create their own benchmarks, promptfoo[0] is one way to do this locally:

  prompts:
    - "Write this in Python 3: {{ask}}"
  
  providers:
    - ollama:chat:llama3:8b
    - ollama:chat:phi3
    - ollama:chat:qwen:7b
    
  tests:
    - vars:
        ask: a function to determine if a number is prime
    - vars:
        ask: a function to split a restaurant bill given individual contributions and shared items
Jumping in because I'm a big believer in (1) local LLMs, and (2) evals specific to individual use cases.

[0] https://github.com/typpo/promptfoo