| HN Mirror

"google/gemini-2.5-pro-preview-03-25" => 67.65 "anthropic/claude-3.7-sonnet:thinking" => 66.76 "anthropic/claude-3.7-sonnet" => 66.23 "deepseek/deepseek-r1:free" => 54.38 "google/gemini-2.0-flash-001" => 52.03 "openai/o3-mini" => 47.82 "qwen/qwen2.5-32b-instruct" => 44.78 "meta-llama/llama-4-maverick:free" => 42.87 "openrouter/quasar-alpha" => 40.27 "openai/chatgpt-4o-latest" => 37.94 "meta-llama/llama-3.3-70b-instruct:free" => 34.40

https://gist.github.com/fpgaminer/8782dd205216ea2afcd3dda29d...

That's the model automation. To evaluate the prompts it suggests I have a sample of my dataset with 128 examples. For this particular run, all I cared about was optimizing a prompt for Llama 3.1 that would get it to write responses like those I'm finetuning for. That way the finetuning has a better starting point.

So to evaluate how effective a given prompt is, I go through each example and run <user>prompt</user><assistant>responses</assistant> (in the proper format, of course) through llama 3.1 and measure the NLL on the assistant portion. I then have a simple linear formula to convert the NLL to a score between 0 and 100, scaled based on typical NLL values. It should _probably_ be a non-linear formula, but I'm lazy.

Another approach to prompt optimization is to give the model something like:

  I have some texts along with their corresponding scores. The texts are arranged in ascending order based on their scores from worst (low score) to best (higher score).
  
  Text: {text0}
  Score: {score0}
  Text: {text1}
  Score: {score1}
  ...
  
  Thoroughly read all of the texts and their corresponding scores.
  Analyze the texts and their scores to understand what leads to a high score. Don't just look for literal patterns of words/tokens. Extensively research the data until you understand the underlying mechanisms that lead to high scores. The underlying, internal relationships. Much like how an LLM is able to predict the token not just from the literal text but also by understanding very complex relationships of the "tokens" between the tokens.
  Take all of the texts into consideration, not just the best.
  Solidify your understanding of how to optimize for a high score.
  Demonstrate your deep and complete understanding by writing a new text that maximizes the score and is better than all of the provided texts.
  Ideally the new text should be under 20 words.

Or some variation thereof. That's the "one off" approach where you don't keep a conversation with the model and instead just call it again with the updated scores. Supposedly that's "better" since the texts are in ascending order, letting the model easily track improvements, but I've had far better luck with the iterative, conversational approach.

Also the constraint on how long the "new text" can be is important, as all models have a tendency of writing longer and longer prompts with each iteration.