Hacker News new | ask | show | jobs
by ravenstine 1589 days ago
Hmmm... I'm skeptical. Not necessarily because I don't think that pneumonia could be detected by testing VOCs in breath, but because I'm currently working on a project that uses sensors to do breath analysis and my amateur research has informed me that it's fairly hard to get right (which is why my primary goal is to identify deltas rather than achieve numerical accuracy).

For one, VOCs can be present in breath for other reasons besides some sort of infection in the lung, and VOCs are incredibly hard to differentiate with just a sensor. The fact that they tend to be faint in human breath even at their highest (in contrast to O2 and CO2) doesn't help. Even the most expensive PID sensors for VOCs (they get up into the several hundreds a pop) can't really tell you whether the predominant gas is acetone or alcohol or acetaldehyde or hydrogen sulfide. So you've got to figure out whether the presence of VOCs is truly an anomaly and not just a part of ketosis. In which case you will also need to measure at least VeO2 to see whether the VOCs correspond with the Respiratory Quotient.

The "e-nose" project, as described on the MakeZine article, doesn't appear to do that. It does have an alcohol sensor. But these sensors aren't particularly sophisticated. They use semiconductors with heating elements to detect the presence of gases, and there is almost certainly some overlap between the alcohol and VOCs sensors.

If VOCs are produced by pneumonia, then yes, it's conceivable that even just the VOCs sensor alone would detect this. But can this group of sensors used in the e-nose differentiate pneumonia from catabolism?

Maybe? ¯\_(ツ)_/¯

After all, this thing uses AI. And maybe AI can recognize something that a human can't by simply looking at a line graph. I dunno... Such things should be tested against known inputs before being suggested to diagnose anything.

6 comments

In my experience, "AI can extract more information from sensors" is mostly a myth.

An example is the SCIO sensor ( https://nocamels.com/2019/03/scio-kickstarter-darling-promis... ) which was a cheap handheld spectrometer that claimed to accurately determine the nutritional information of any food you pointed it at.

One good way to debunk this is to measure raw sensor output and compute Mutual Information (which incorporates sensor noise/variability). If the sensor only produces X bits of information, no algorithm will be able to extract more classes than that. In the SCIO case it was just under 8 bits total of information. So something like a poor color sensor. You could train on apples and oranges and maybe do an investor demo, but it's not actually going to do anything useful (as the Kickstarter crowd soon learned).

True, but there are things where AI can help. For example, in the domain of electronic gas sensors, AI can be used to disentangle confounding variables like gas, humidity and temperature. All three affect the sensor output in a nonlinear fashion, and an ANN can learn the transfer function that extracts the (almost) pure gas response.
Yes combining relatively independent sensors will increase the MI.
The sensors are not independent.

Gas sensing is really tricky. Metal oxide gas sensors respond nonlinearly to all three of gas, temperature, and humidity. Plus they drift. AI can help with the nonlinear response. Drift hasn't been solved yet, as far as I know.

Understood,the point was if the sensors are identical, they don't give any more information, some independence is needed.
Is the limit: A) sensor resolution, B) NN architecture and/or algorithm, C) training sample size, D) training data (labeling, segmentation) quality, or E) it doesn't sufficiently predict the variance with low enough error?

New NN models are able to do more with the exact same sensor data.

You cannot conjure information out of thin air. Even with infinite data and a hypothetical wormhole CPU that runs everything in O(1) and solves the halting problem, you still couldn't do this. So to answer your question, the reason is effectively (A). Sensor resolution might be the wrong term but it's the general idea.
How much information content is there in DNA (and RNA,)? How do creatures know or learn what not to eat given limited available sensor data?
How much information content is there in DNA? 2 bits per base, before compression. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3220916/

How do creatures know what to eat? Evolution solved that for most creatures, so their sensors don't have to work as hard at runtime. And in other cases, some number of members of a population of creatures will die before the population learns the food is poisonous. Our sensors, and the information processing systems that manage their outputs, are remarkably efficient data processing engines that do the equivalent of approximating and predicting, often well beyond what the most advanced deep learning systems are capable of doing now.

So, sensor resolution is higher, there are multiple fields being integrated, in a massively-parallel spreading-activation Biological Neural Network, and that's how blank-slate creatures just know?

Is there enough information content - per the Shannon entropy definition or otherwise - in DNA and/or RNA to code for the survival-selected traits that

I'm not sure that the (Shannon entropy, MIC, Kolmogorov,) information content of the samples is the limit of any given network trained therefrom? Is there anything to be gained from upsampling and adding e.g. gaussian blur (noise)? Maybe it's feature engineering, maybe it's expert methods bias, maybe it's just sensor fusion; that's the magic noise.

Because they receive additional information from the environment through highly sensitive sensors producing massive amounts of information. Whereas the information you get from a cheap sensor effectively discretizes to a few bits.
You can, but it's called making stuff up
Having designed sensor systems, I've lost more than a few hours of my life having to explain "why do we need that big expensive sensor when you can do everything with machine learning?"

The idea that a magic math technique can replace expensive sensors predates NN's by a few decades. Dozens of start-ups have gone bankrupt trying to do non-invasive blood glucose with portable sensors.

This is a very crude but at least conceptually useful rule of thumb: It's all of the above, but ultimately the analysis result is a mathematical function of an array of values produced by the sensor. Very few math functions do not have the property, that variation in the output increases with the level of variation in the input.

AI can extract information from a sensor that is 'obvious' when you look at it by eye, yet no easy combination of frequency filters and a carefully tuned threshold can extract reliably.
AI can detect more information in the whole dataset, because it for example has the whole "breath in- breath out" cycle in view. Fungi residing in the mouth would be present as background noise even during breathing in and out. But fungi-products existing at the end of a breath out cycle, are most likely to originate from the lungs, due to the mouth contamination being "flushed" out by the breath itself.
Priors can make sensor information more useful maybe, but that is just knowledge that helps first limit possibilities before taking a measurement. Priors also work against you when you are trying to sense something novel that might indicate a thing you don't expect.

An aside on sparsity priors (which that article uses).. reality is actually a lot less sparse than the researcher models would have you believe. If most dimensions are not truly zero (e.g., have some small noise present) these sparsity methods fall apart. That's why you (never?) see the methods deployed in actual products.

Specifically, the support determination step usually breaks down in epsilon sparse and you also get "noise folding".

It looks like the principle is that a machine learning model trained on the combined output of four different kinds of gas sensors can discover correlations between unintentional characteristics of the sensors. For example, the manufacturer of an ethanol or nitrogen dioxide sensor is not going to specify anything about how it responds to vanillin, but it seems plausible to me that the relationship between their responses contains some hidden information that could help to discriminate between vanillin and eugenol. With enough different sensors, there's quite a bit of information to be found in mining their undefined behavior.

That is to say, you can treat the sensor reading as being completely meaningless and skip interpreting it as indicating VOC levels. You're just using the sensors as black boxes that produce arbitrary values with the property that exposure to organic vapor changes the output "somehow", and letting model training find some meaning in it.

> With enough different sensors, there's quite a bit of information to be found in mining their undefined behavior.

It sounds like you would need to be exceptionally careful that your meta-process didn't "find" some signal in pure noise (via re-using test sets and so on).

> It sounds like you would need to be exceptionally careful that your meta-process didn't "find" some signal in pure noise (via re-using test sets and so on).

It sounds like you’re actually talking about ordinary levels of carefulness in this (ML) context.

That would be great. I'm no ML expert, but my impression was that standards varied widely from team to team.
Does this mean that each sensor cluster has to be trained independently?
When this technique is performing at its best, I would expect so. The old story of the evolved FPGA comes to mind: https://www.damninteresting.com/on-the-origin-of-circuits/

You're intentionally depending on the "personality" of each gas sensor to get data measuring unknown features, so you can't expect consistency from sample to sample. Anything that was completely portable between different sensors would inherently be less powerful.

Most high-accuracy systems incorporate an onboard calibration target of some kind. Could be a gas cell (either sealed or consumable) or a special lamp etc. Or you buy an instrument that comes with calibration coefficients from the manufacturer. For example if you sell spectrometers, you put in the grating and manually adjust it for the desired range. This is the case for cheaper instruments (eg Ocean Optics) as well as expensive bespoke systems which are all hand built. Even if the grating and mirror mounts are fixed, the tolerance in manufacturing is rarely good enough that calibration isn't required. It's way cheaper to do some relatively low accuracy machining and then just epoxy all the screws down.

In this case you'd probably calibrate each sensor to a standard chemical sample and then use the calibration output. You could train on that, not the raw samples and then you have a model that works on all devices.

He was specifically looking to identify fungal pneumonia not just any old kind of pneumonia.

The linked Wikipedia article indicates mortality in immunocompromised patients can be as high as 90 percent. That sentence fits with my general impression that fungal pneumonia is both real serious shit and also typically found in people with advanced cases of other serious medical problems, like AIDS or cystic fibrosis.

It sounds reasonably plausible to me that it's feasible to detect fungal pneumonia in specific this way with some reasonable confidence level.

From working on the environmental sensor side of things, I'd concur. The VOCs will be able to be picked up, but the cross talk will be huge across other VOCs that don't themselves indicate pneumonia. There isn't one VOC, there's thousands. False positives are written all over this. This is the very same approach Theranos went. On a science level, sure, technically possible maybe. You'll even get boolean outputs. But on an engineering and regulatory level, you're in for a world of pain without the spectral tech that is still 2-5 years away before this is worth basing human lives on.
Well one thing is the teen in question probably has little to no exposure to a cohort of humans who have fungal pneumonia to test this on.
This is what I was wondering too. To train a model you need lots of data. How do you get it for such a project?
You would partner with doctors at a research institution that had lots of patients of this type, and the doctors would need to know how to run a clinical trial. But realistically, you would do this in any number of ways using existing samples before running a trial. Tissue samples are fairly easy to get.
I have a mid-price gadget for measuring inside air quality - it detects VOC and formaldehyde, along with PM2.5 / PM 10.

It also detects alcohol from drinking a couple of beers as a dangerous increase in formaldehyde...