Hacker News new | ask | show | jobs
by barfbagginus 772 days ago
The limitations section mentions the study omitted RAG and focused on base performance as a key bottle neck. But given the usefulness of RAG, and weakness of base LLMs for this kind of task, base recall performance is not necessarily relevant or a key bottle neck preventing accurate coding.

Adding even some slapdash RAG attempts would have provided a more realistic and still disappointing result, since assisted LLMs are still only around 75% accurate (see the RAG paper another author shares in their comment). I suppose the space of possible RAG solutions makes it hard to represent fairly, so is reasonably left to further research.

I appreciate testing base performance, with a STRONG proviso that a relevant conclusion requires more work, along the lines of RAG and other tools. I wish this was communicated more clearly in the intro and abstract, and wonder if the authors had some unstated reasons for not being more blatant about that.

The study does provide an interesting value. Its benchmark is open source and extensive. It should be easy to adapt and replicate in other systems. It could become a target benchmark for tool and retrieval enhanced medical coding LLMs.