Hacker News new | ask | show | jobs
by ogarten 710 days ago
How does this surprise anyone?

Medical data for AI training is almost always sources in some more or less shady country because they lack any privacy regulations. It's then annotated by a hoard of cheap workers who may or may not have advanced medical training.

Even "normal medicine" is extremely biased towards male people fitting inside the norm which is why a lot of things are not detected early enough in women or in people who do not match that norm.

Next thing: Doctors often think that their annotations are the absolute gold standard but they don't necessarily know everything that is in an X-Ray or an MRI.

A few years ago we tried to build synthetic data for this exact purpose by simulating medical images for 3D body models with different diseases and nobody we talked to cared about it, because "we have good data".

3 comments

Yep, you nailed it. You really don't have to think hard about why AI which only learns from what we feed it and can access has gaps and biases more pronounced than the real world. AI lives in the internet world, it's trained on horrible cesspools of anonymous text like 4chan and reddit. No wonder it will be biased. If you only try to feed it sanitary data you wouldn't have enough to get the results we get now.
> You really don't have to think hard about why AI which only learns from what we feed it

Sadly I'd say that people are no different.

> it's trained on horrible cesspools of ....

So it's really not the future of AI we should be worrying about...

People are different because they have human social interaction offline where terminally online stereotypes and biases often fall apart
I have been quite anti-HIPPA since realizing how 'privacy' was the excuse to stunt science.

My conspiracy: With massive medical data, ML/AI would have been 'discovered'/built sooner. Limiting the data makes it so only a few people can be specialists under the supervision of medical cartels.

You have misunderstood. The HIPAA (not "HIPPA") Privacy Rule doesn't stunt science. It's easy to request patient consent for using PHI, and properly de-identified data isn't even considered to be PHI.

https://privacyruleandresearch.nih.gov/pr_08.asp

https://www.hhs.gov/hipaa/for-professionals/privacy/index.ht...

Great, where can I find your medical data on the web? Care to give an URL? Would be perfect to include your salary.
Not OP. If my doctors and hospitals would give it to me in a good and easily-collatable format, then I and who knows how many other people would gladly donate it to science or for research purposes. Heck some people have this tendency to donate actual body parts and their entire bodies to science, so it's not a big stretch to say some would donate this sort of personal information for a purpose of their own choosing.

This isn't a fully settled debate (including your salary example) so you can't just assume one side is right and argue it as if it's some unquestionable human right.

Many healthcare providers now have APIs for patients to download their data in standard HL7 FHIR and/or CDA format. You should ask about this, and if they don't have it then consider switching providers. All modern EHRs have that functionality built in so for providers it's simply a matter of switching it on.

However, most of that data is useless for research purposes. Even if the format complies with industry standards the quality is often bad with many data elements lacking consistent coding. You can't just feed clinical data from a bunch of different random sources into a research project and expect to get accurate results: it's a garbage in / garbage out issue. That's why most clinical research studies involve just a few provider organizations so that the researchers can properly configure the systems and train the clinicians on consistent data entry.

Deanonymized? For sale?
I know GPT4o can diagnose medical images. Is their model likely to be using the same kind of datasets as these models for medical systems?