Hacker News new | ask | show | jobs
by tipoffdosage904 22 days ago
I gave four local models a production A/B test to analyze — connect to Supabase, pull live experiment data, run Welch's t-test + chi-square, build charts, output a structured summary, and make a grounded ship/don't-ship recommendation.

Caveat: I don't really NEED an LLM to automate experiment analysis, nor do I think it's a good real-world LLM use case, but this was a very interesting test of complex multi-step tool calling and hallucination resistance over a long procedural task. In short, these tiny <35B parameter models are capable enough for such narrow agentic tasks.

Results on my M4 Pro MacBook (48GB): - Qwen 3.6 35B A3B (MoE): 100/100 — perfect - Qwen 3.6 27B MTP: 90/100 — wrong completion rates - Qwen 3.5 9B: 90/100 — same error as the 27B - Qwopus 3.5 9B Coder (fine-tune of Qwen 3.5 9B): 60/100 first run, 80/100 on rerun — same prompt, different mistakes

Some interesting learnings - even good models make the same mistakes a junior DS would make, and this is very specific to clickstream metric definitions where you need to decide if you need to use session-level or user-level data. And of course, the LLM experience is and has always been of non-determinism, so you can give them the same task multiple times and just get different results.

The post includes the full benchmark prompt, scoring methodology, and a link to the live workbench.