Hacker News new | ask | show | jobs
by garrinm 1036 days ago
Hi, thanks for the comment.

I put this project pretty quickly and I don't want to pretend there is tremendous depth behind any of the decisions I made :/.

For now the only source is the Stats Pearl book published on ncbi.nlm.nih.gov (the only place this is mentioned is here: https://github.com/clint-llm/clint-cli/blob/main/README.md#u...). It contains about 11,000 peer reviewed articles about anatomy and conditions: https://www.ncbi.nlm.nih.gov/books/NBK430685/. The copyright terms are CC BY-NC-ND 4.0. I might add some Wikipedia articles to this in the future.

I chunk the documents by section, and embed only the first 2048 tokens that fit in the OpenAI embeddings. I'm using OpenAI for embedding as opposed to something like all-minilm-l6-v2 because I don't want to have to ship a model to the clients (transfer times could be large and supporting this would increase the complexity of the library).

I didn't experiment with different chunk sizes, and I suspect something smaller would be more beneficial as you point out. But it would also complicate the logic, and most choices I made in this project were to remove complexity and get this done quickly. If I revisit this I might chunk by paragraph on your advice :).

RAG is indeed what is being used. But it a few different ways. The diagnoses are refined using a pretty straightforward RAG prompt: consider these notes ... consider this diagnosis ... can you improve on it etc.

But in a way the entire program is RAG-based. In most prompts some documents are added to the system message for context. It's not clear that the information in the documents is always used, but based on a bit of experimentation it seems to improve various responses.

I have no plans to fine tune. I'm not sure how beneficial would be fine tuning here. The model needs a fair bit of general knowledge to reason about descriptions of symptoms. Fine tuning could over-specialize it. And hallucinations could come up even with fine-tuning, so you would probably want a RAG-like prompt to get it to focus on real details.

This this is very much a hobby, so I haven't dug deep enough to look into other models. But I'd be _very_ curious to see how GPT 3.5 with RAG compares to vanilla MedPALM. In my experience GPT 3.5 can reason quite well about with the right documents in the context.

1 comments

Thanks for the details. We're also trying to compare GPT 4 vs GPT 3.5 vs LLAMA2 and we've been putting together "exams" though one other thing on our mind is what happens when the next foundational model comes and scrapes our exams and the exam questions make their way into the training set.

This really does make me wonder about all the attempts to administer the USMLE to LLMs -- what are the chances the USMLE administered was uniquely created vs just put together from exam questions online? Ultimately no LLM is required if these exam questions are in the training set...just need a k-v lookup :-)