| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mitjam 346 days ago
	Really love this. Would love to see an actual end to end example video of you creating, planning, and implementing a task using your preferred models and apps.

1 comments

mrlesk 346 days ago

Will definitely do. I am also planning to run a benchmark with various models to see which one is more effective at building a full product starting from a PRD and using backlog for managing tasks

link

bazooka5798 346 days ago

I'd love to see openRouter connectivity to try non Claude models for some of the planning parts of the cycle.

link

westurner 346 days ago

Is there an established benchmark for building a full product?

- SWE-bench leaderboard: https://www.swebench.com/

- Which metrics for e.g. "SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork"? https://news.ycombinator.com/item?id=43101314

- MetaGPT, MGX: https://github.com/FoundationAgents/MetaGPT :

> Software Company as Multi-Agent System

> MetaGPT takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc. Internally, MetaGPT includes product managers / architects / project managers / engineers. It provides the entire process of a software company along with carefully orchestrated SOPs.

- Mutation-Guided LLM-based Test Generation: https://news.ycombinator.com/item?id=42953885

- https://news.ycombinator.com/item?id=41333249 :

- codefuse-ai/Awesome-Code-LLM > Analysis of AI-Generated Code, Benchmarks: https://github.com/codefuse-ai/Awesome-Code-LLM :

> 8.2 Benchmarks: Integrated Benchmarks, Evaluation Metrics, Program Synthesis, Visually Grounded Program, Synthesis, Code Reasoning and QA, Text-to-SQL, Code Translation, Program Repair, Code Summarization, Defect/Vulnerability Detection, Code Retrieval, Type Inference, Commit Message Generation, Repo-Level Coding

- underlines/awesome-ml/tools.md > Benchmarking: https://github.com/underlines/awesome-ml/blob/master/llm-too...

- formal methods workflows, coverage-guided fuzzing: https://news.ycombinator.com/item?id=40884466

- "Large Language Models Based Fuzzing Techniques: A Survey" (2024) https://arxiv.org/abs/2402.00350

link

Leave_OAI_Alone 346 days ago

You have compiled an interesting list of benchmarks and adjacent research. The implicit question is whether an established benchmark for building a full product exists.

After reviewing all this, what is your actual conclusion, or are you asking? Is the takeaway that a comprehensive benchmark exists and we should be using it, or is the takeaway that the problem space is too multifaceted for any single benchmark to be meaningful?

link

westurner 345 days ago

The market - actual customers - is probably the best benchmark for a product.

But then outstanding liabilities due to code quality and technical debt aren't costed in by the market.

There are already code quality metrics.

SAST and DAST tools can score or fix code, as part of a LLM-driven development loop.

Formal verification is maybe the best code quality metric.

Is there more than Product-Market fit and infosec liabilities?

link