I've worked with LLM's for the better part of the last couple of years, including on evals, but I still don't understand a lot of what's being suggested. What exactly is a "custom annotation tool", for annotating what?
Concrete example from my own workflows: in my IDE whenever I accept or reject a FIM completion, I capture that data (the prefix, the suffix, the completion, and the thumbs up/down signal) and put it in a database. The resultant dataset is annotated such that I can use it for analysis, debugging, finetuning, prompt mgmt, etc. The "custom" tooling part in this case would be that I'm using a forked version of Zed that I've customized in part for this purpose.
Typically, you would collect a ton of execution traces from your production app. Annotating them can mean a lot of different things, but often it means some mixture of automated scoring and manual review. At the earliest stages, you're usually annotating common modes of failure, so you can say like "In 30% of failures, the retrieval component of our RAG app is grabbing irrelevant context." or "In 15% of cases, our chat agent misunderstood the user's query and did not ask clarifiying questions."
You can then create datasets out of these traces, and use them to benchmark improvements you make to your application.