Hacker News new | ask | show | jobs
by TuringNYC 1023 days ago
Dear @garrinm Firstly, thank you for sharing this! We built something like this at work, and i'd love if you could share details on your effort.

I tried to follow: https://github.com/clint-llm/clint-cli/blob/main/clint/scrip... but it wasnt clear -- are you indexing the medical literature in entirety or just abstracts?

We tried to do this with arXiv and others, but getting commercial rights was difficult and we got stuck on that, could you share which medical literature source you used.

I tried to follow the code and it looks like you embed, so i'm assuming you're using RAG, is that it, or are you trying to fine-tune also? I didnt see any fine-tune code. (We didnt fine tune due to cost)

Did you benchmark different embedding chunk sizes, etc? (Yes for us! We've tried a matrix search of chunk sizes, including sliding window and found the sweet spot for different types of media, usually a single paragraph)

Did you manage to get access to a fine-tuned model like MedPALM and benchmark that? (we are still awaiting access)

1 comments

Hi, thanks for the comment.

I put this project pretty quickly and I don't want to pretend there is tremendous depth behind any of the decisions I made :/.

For now the only source is the Stats Pearl book published on ncbi.nlm.nih.gov (the only place this is mentioned is here: https://github.com/clint-llm/clint-cli/blob/main/README.md#u...). It contains about 11,000 peer reviewed articles about anatomy and conditions: https://www.ncbi.nlm.nih.gov/books/NBK430685/. The copyright terms are CC BY-NC-ND 4.0. I might add some Wikipedia articles to this in the future.

I chunk the documents by section, and embed only the first 2048 tokens that fit in the OpenAI embeddings. I'm using OpenAI for embedding as opposed to something like all-minilm-l6-v2 because I don't want to have to ship a model to the clients (transfer times could be large and supporting this would increase the complexity of the library).

I didn't experiment with different chunk sizes, and I suspect something smaller would be more beneficial as you point out. But it would also complicate the logic, and most choices I made in this project were to remove complexity and get this done quickly. If I revisit this I might chunk by paragraph on your advice :).

RAG is indeed what is being used. But it a few different ways. The diagnoses are refined using a pretty straightforward RAG prompt: consider these notes ... consider this diagnosis ... can you improve on it etc.

But in a way the entire program is RAG-based. In most prompts some documents are added to the system message for context. It's not clear that the information in the documents is always used, but based on a bit of experimentation it seems to improve various responses.

I have no plans to fine tune. I'm not sure how beneficial would be fine tuning here. The model needs a fair bit of general knowledge to reason about descriptions of symptoms. Fine tuning could over-specialize it. And hallucinations could come up even with fine-tuning, so you would probably want a RAG-like prompt to get it to focus on real details.

This this is very much a hobby, so I haven't dug deep enough to look into other models. But I'd be _very_ curious to see how GPT 3.5 with RAG compares to vanilla MedPALM. In my experience GPT 3.5 can reason quite well about with the right documents in the context.

Thanks for the details. We're also trying to compare GPT 4 vs GPT 3.5 vs LLAMA2 and we've been putting together "exams" though one other thing on our mind is what happens when the next foundational model comes and scrapes our exams and the exam questions make their way into the training set.

This really does make me wonder about all the attempts to administer the USMLE to LLMs -- what are the chances the USMLE administered was uniquely created vs just put together from exam questions online? Ultimately no LLM is required if these exam questions are in the training set...just need a k-v lookup :-)