> limited to a single markdown file of instructions
single file of instructions is common in most benchmark papers, e.g. Terminal Bench. Also we have very complicated prompts like this one: https://www.skillsbench.ai/tasks/shock-analysis-supply
> opaque verifier
Could you specify which tasks' verifier is not clear or defective for benchmarking purpose?