Show HN: Which LLM Finds Obscure Knife-Brand URLs Cheapest? (8-Model Benchmark)

Y	Hacker News new \| ask \| show \| jobs

Show HN: Which LLM Finds Obscure Knife-Brand URLs Cheapest? (8-Model Benchmark) (new.knife.day)

2 points by p-s-v 375 days ago

Hi HN,

I’m building *new.knife.day* (https://new.knife.day), a crowd-sourced database of every cutlery maker—from Al Mar to brands so small they barely show up on Google. That means I need an automated way to fetch each brand’s official website, even for fringe names like “Actilam” or “Aiorosu Knives”.

So I threw the task at eight web-enabled LLMs via OpenRouter:

  • gpt-4o and gpt-4o-mini
  • claude-sonnet-4
  • gemini-2.5-pro and gemini-2.0-flash
  • llama-3.1-70b
  • qwen-2.5-72b
  • perplexity sonar-deep-research

Prompt: Return *only* JSON { brand, official_url, confidence } Data set: 10 obscure knife brands Scoring: exact domain = correct; “no official site” (with reason) = correct Costs: OpenRouter prices on 31 May 2025 (Perplexity billed separately)

Highlights ----------

  • Perplexity hit 10/10 but cost $9.42 (860 k tokens!).
  • GPT-4o-mini & Llama-3.1-70B got 9/10 for ~2 ¢ per correct URL.
  • Gemini Flash managed 7/10 for $0.001 total—great if you can QA the misses.
  • Half of Gemini 2.5 Pro’s replies were HTML tables my parser rejected.

Full table, code, and raw logs are in the post (and on GitHub).

Take-aways ----------

  1. 90 % accuracy + quick human review often beats 100 % accuracy that costs
     45× more.
  2. Structured output is part of model quality—validate JSON on arrival.
  3. Promo pricing moves fast; always ping the price API before large runs.

Next step: wire GPT-4o-mini into *new.knife.day* so visitors get verified manufacturer links. Crawling ~250 brands now costs under $5.

Curious what you’d improve, and which model you’d bet on for similar “find the canonical URL” tasks. AMA on the setup, prompts, or results!