Hacker News new | ask | show | jobs
by panabee 328 days ago
This is long overdue for biomedicine.

Even Google DeepMind's relabeled MedQA dataset, created for MedGemini in 2024, has flaws.

Many healthcare datasets/benchmarks contain dirty data because accuracy incentives are absent and few annotators are qualified.

We had to pay Stanford MDs to annotate 900 new questions to evaluate frontier models and will release these as open source on Hugging Face for anyone to use. They cover VQA and specialties like neurology, pediatrics, and psychiatry.

If labs want early access, please reach out. (Info in profile.) We are finalizing the dataset format.

Unlike general LLMs, where noise is tolerable and sometimes even desirable, training on incorrect/outdated information may cause clinical errors, misfolded proteins, or drugs with off-target effects.

Complicating matters, shifting medical facts may invalidate training data and model knowledge. What was true last year may be false today. For instance, in April 2024 the U.S. Preventive Services Task Force reversed its longstanding advice and now urges biennial mammograms starting at age 40 -- down from the previous benchmark of 50 -- for average-risk women, citing rising breast-cancer incidence in younger patients.

8 comments

This is true for every subfield I have been working on for the past 10 years. The dirty secret of ML research is that Sturgeon's law apply to datasets as well - 90% of data out there is crap. I have seen NLP datasets with hundreds of citations that were obviously worthless as soon as you put the "effort" in and actually looked at the samples.
100% agreed. I also advise you not to read many cancer papers, particularly ones investigating viruses and cancer. You would be horrified.

(To clarify: this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth. Hug cancer researchers. They have accomplished an incredible amount while being handcuffed and tasked with decoding the most complex operating system ever designed.)

> this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth

Are scientists not writing those papers? There may be bad incentives, but scientists are responding to those incentives.

That is axiomatically true, but both harsh and useless, given that (as I understand from HN articles and comments) the choice is "play the publishing game as it is" vs "don't be a scientist anymore".
I agree, but there is an important side-effect of this statement: it's possible to criticize science, without criticizing scientists. Or at least without criticizing rank and file scientists.

There are many political issues where activists claim "the science has spoken." When critics respond by saying, "the science system is broken and is spitting out garbage", we have to take those claims very seriously.

That doesn't mean the science is wrong. Even though the climate science system is far from perfect, climate change is real and human made.

On the other hand, some of the science on gender medicine is not as established medical associates would have us believe (yet, this might change in a few years). But that doesn't stop reputable science groups from making false claims.

If we’re not going to hold any other sector of the economy personally responsible for responding to incentives, I don’t know why we’d start with scientists. We’ve excused folks working for Palantir around here - is it that the scientists aren’t getting paid enough for selling out, or are we just throwing rocks in glass houses now?
Valid critique, but one addressing a problem above the ML layer at the human layer. :)

That said, your comment has an implication: in which fields can we trust data if incentives are poor?

For instance, many Alzheimer's papers were undermined after journalists unmasked foundational research as academic fraud. Which conclusions are reliable and which are questionable? Who should decide? Can we design model architectures and training to grapple with this messy reality?

These are hard questions.

ML/AI should help shield future generations of scientists from poor incentives by maximizing experimental transparency and reproducibility.

Apt quote from Supreme Court Justice Louis Brandeis: "Sunlight is the best disinfectant."

Not a answer, but contributory idea - Meta-analysis. There are plenty of strong meta-analysis out there and one of the things they tend to end up doing is weighing the methodological rigour of the papers along with the overlap they have to the combined question being analyzed. Could we use this weighting explicitly in the training process?
Thanks. This is helpful. Looking forward to more of your thoughts.

Some nuance:

What happens when the methods are outdated/biased? We highlight a potential case in breast cancer in one of our papers.

Worse, who decides?

To reiterate, this isn’t to discourage the idea. The idea is good and should be considered, but doesn’t escape (yet) the core issue of when something becomes a “fact.”

Scientists are responding to the incentives of a) wanting to do science, b) for the public benefit. There was one game in town to do this: the American public grant scheme.

This game is being undermined and destroyed by infamous anti-vaxxer, non-medical expert, non-public-policy expert RFK Jr.[1] The disastrous cuts to the NIH's public grant scheme is likely to amount to $8,200,000,000 ($8.2 trillion USD) in terms of years of life lost.[2]

So, should scientists not write those papers? Should they not do science for public benefit? These are the only ways to not respond to the structure of the American public grant scheme. It seems to me that, if we want better outcomes, then we should make incremental progress to the institutions surrounding the public grant scheme. This seems fair more sensible than installing Bobby Brainworms to burn it all down.

[1] https://youtu.be/HqI_z1OcenQ?si=ZtlffV6N1NuH5PYQ

[2] https://jamanetwork.com/journals/jama-health-forum/fullartic...

> This is true for every subfield I have been working on for the past 10 years

Hasn’t data labelling being the bulk of the work been true for every research endeavour since forever?

If you download data sets for classification from Kaggle or CIFAR or search ranking from TREC it is the same. Typically 1-2% of judgements in that kind of dataset are just wrong so if you are aiming for the last few points of AUC you have to confront that.
I still want to jump off a bridge whenever someone thinks they can use the twitter post and movie review datasets to train sentiment models for use in completely different contexts.
To elaborate, errors go beyond data and reach into model design. Two simple examples:

1. Nucleotides are a form of tokenization and encode bias. They're not as raw as people assume. For example, classic FASTA treats modified and canonical C as identical. Differences may alter gene expression -- akin to "polish" vs. "Polish".

2. Sickle-cell anemia and other diseases are linked to nucleotide differences. These single nucleotide polymorphisms (SNPs) mean hard attention for DNA matters and single-base resolution is non-negotiable for certain healthcare applications. Latent models have thrived in text-to-image and language, but researchers cannot blindly carry these assumptions into healthcare.

There are so many open questions in biomedical AI. In our experience, confronting them has prompted (pun intended) better inductive biases when designing other types of models.

We need way more people thinking about biomedical AI.

> What was true last year may be false today. For instance, ...

Good example of a medical QA dataset shifting but not a good example of a medical "fact" since it is an opinion. Another way to think about shifting medical targets over time would be things like environmental or behavioral risk factors changing.

Anyways, thank you for putting this dataset together, certainly we need more third-party benchmarks with careful annotations done. I think it would be wise if you segregate tasks between factual observations of data, population-scale opinions (guidelines/recommendations), and individual-scale opinions (prognosis/diagnosis). Ideally there would be some formal taxonomy for this eventually like OMOP CDM, maybe there is already in some dusty corner of pubmed.

What if there is significant disagreement within the medical profession itself? For example, isotretinoin is proscribed for acne in many countries, but in other countries the drug is banned or access restricted due to adverse side effects.
Would not one approach be to just ensure the system has all the data? Relevance to address systems, side effects, and legal constraints. Then when making a recommendations it can account for all factors not just prior use cases.
If you agree that ML starts with philosophy, not statistics, this is but one example highlighting how biomedicine helps model development, LLMs included.

Every fact is born an opinion.

This challenge exists in most, if not all, spheres of life.

I think an often overlooked aspect of training data curation is the value of accurate but oblique data. Much of the “emergent capabilities “ of LLMs comes from data embedded in the data, implied or inferred semantic information that is not readily obvious. Extraction of this highly useful information, in contrast to specific factoids, requires a lot of off axis images of the problem space, like a CT scan of the field of interest. The value of adjacent oblique datasets should not be underestimated.
I noticed this when adding citations to wikipedia.

You are may find a definition of what a "skyscraper" is, by some hyperfocused association, but you'll get a bias towards a definite measurement like "skyscrapers are buildings between 700m to 3500m tall", which might be useful for some data mining project, but not at all what people mean by it.

The actual definition is not in a specific source but in the way it is used in other sources like "the Manhattan skyscraper is one of the most iconic skyscrapers", on the aggregate you learn what it is, but it isn't very citable on its own, which gives WP that pedantic bias.

Synthetic data generation techniques are increasingly being paired with expert validation to scale high-quality biomedical datasets while reducing annotation burden - especially useful for rare conditions where real-world examples are limited.
Centaur Labs does medical data labeling https://centaur.ai/
Isn't labelling medical data for ai illegal as unlicensed medical practice?

Same thing with law data

Paralegals and medical assistants don’t need licenses
I think their question is a good one, and not being taken charitably.

Lets take the medical assistant example.

> Medical assistants are unlicensed, and may only perform basic administrative, clerical and technical supportive services as permitted by law.

If they're labelling data that's "tumor" or "not tumor", with any agency of the process,does that fit within their unlicensed scope? Or, would that labelling be closer to a diagnosis?

What if the AI is eventually used to diagnose, based on data that was labeled by someone unlicensed? Should there there need to be a "chain of trust" of some sort?

I think the answer to liability will be all on the doctor agreeing/disagreeing with the AI...for now.

To answer this, I would think we should consider other cases where someone could practice medicine without legally doing so. For example, could they tutor a student and help them? Go through unknown cases and make judgement, explaining their reasoning? As long as they don't oversell their experience in a way that might be considered fraud, I don't think this would be practicing medicine.

It does open something of a loophole. Oh, I wasn't diagnosing a friend, I was helping him label a case just like his as an educational experience. My completely IANAL guess would be that judges would look on it based on how the person is doing it, primarily if they are receiving any compensation or running it like a business.

But wait... the example the OP was talking about is doing it like a business and likely doesn't have any disclaimers properly sent to the AI, so maybe that doesn't help us decide.

A bit simpler, but if they are training the AI to answer law questions or medical questions (specific to a case, and not general), then that's what I would argue is unlicensed practice.

Of course it's the org and not the individual who would be practicing, as labelling itself is not practicing.

No.
Illegal?