It actually does not -- and that is part of the issue. Consumers just see "oh gosh this looks very detailed" and superficially think someone must of spent quite a bit of time on this and it works well.
Skills are just prompts -- and most of what I am seeing are people using AI to write the (quite verbose) prompts. There should be a test, somewhere, that shows "my prompt does better than XYZ other prompt" for some model and some specific inputs. This is what is called a benchmark.
It may work well, I don't know. Just asking Claude "hey help me iterate on a paper" works pretty well out of the box too. Call me skeptical this actually works in any substantive way without seeing any evidence it works.
I agree writing a good benchmark takes time. How do people know if all these prompts they are writing are any good though? You could make an edit and it causes a regression overall. Or add too much info and it is just wasted space in the context window, or causes the model to go in loops between the different skills, or plenty of other errors.
I read it, right there on the OP. Tests and test results, including discussions of flaws with earlier designs and how they are improved here. What are you talking about?
Skills are just prompts -- and most of what I am seeing are people using AI to write the (quite verbose) prompts. There should be a test, somewhere, that shows "my prompt does better than XYZ other prompt" for some model and some specific inputs. This is what is called a benchmark.
It may work well, I don't know. Just asking Claude "hey help me iterate on a paper" works pretty well out of the box too. Call me skeptical this actually works in any substantive way without seeing any evidence it works.
I agree writing a good benchmark takes time. How do people know if all these prompts they are writing are any good though? You could make an edit and it causes a regression overall. Or add too much info and it is just wasted space in the context window, or causes the model to go in loops between the different skills, or plenty of other errors.