Hacker News new | ask | show | jobs
by mmooss 32 days ago
The OP evaluates what it has developed with great rigor and describes the evaluation in detail. What do you feel is missing?
2 comments

It actually does not -- and that is part of the issue. Consumers just see "oh gosh this looks very detailed" and superficially think someone must of spent quite a bit of time on this and it works well.

Skills are just prompts -- and most of what I am seeing are people using AI to write the (quite verbose) prompts. There should be a test, somewhere, that shows "my prompt does better than XYZ other prompt" for some model and some specific inputs. This is what is called a benchmark.

It may work well, I don't know. Just asking Claude "hey help me iterate on a paper" works pretty well out of the box too. Call me skeptical this actually works in any substantive way without seeing any evidence it works.

I agree writing a good benchmark takes time. How do people know if all these prompts they are writing are any good though? You could make an edit and it causes a regression overall. Or add too much info and it is just wasted space in the context window, or causes the model to go in loops between the different skills, or plenty of other errors.

I really do run a/b tests. I really do test, and validate.

I do not believe me giving you that information is honest. If I do, I am pretending that you will get the same experience.

Maybe you're using a different model. Maybe you have stuff in your CLAUDE.md that will break it.

It is not honest to me to give you confidence in it, when no one can be confident in it.

> It actually does not

I read it, right there on the OP. Tests and test results, including discussions of flaws with earlier designs and how they are improved here. What are you talking about?

I seriously doubt any human has ever read the full readme for the project.