Hacker News new | ask | show | jobs
by klenwell 2002 days ago
The problem had to do not just with data analysis per se, but with what database researchers call “provenance” — broadly, where did data arise, what inferences were drawn from the data, and how relevant are those inferences to the present situation? While a trained human might be able to work all of this out on a case-by-case basis, the issue was that of designing a planetary-scale medical system that could do this without the need for such detailed human oversight.

I'm not a data scientist and I've never encountered that term "provenance" before but I've encountered the problem he talks about in the wild here and there and have searched for a good way to describe it. His ultrasound example is a great, chilling, example of it.

I also like the term "Intelligence Augmentation" (IA). I've worked for a couple companies who liberally sprinkled the term AI in their marketing content. I always rolled my eyes when I came across it or it came up in say a job interview. What we were really doing, more practically and valuably, was this: IA through II (Intelligent Infrastructure), where the Intelligent Infrastructure was little more than a web view on a database that was previously obscured or somewhat arbitrarily constrained to one or two users.

8 comments

The IA terminology brings to mind the classic "Augmenting Human Intellect"[1] essay by Doug Engelbart (famous for giving "The Mother of all Demos"[2])

[1] https://www.dougengelbart.org/content/view/138

[2] https://en.wikipedia.org/wiki/The_Mother_of_All_Demos

Provenance is an idea that shows up in multiple fields. I first encountered it in discussions of archeology. But then it showed up in, for example, https://www.ralfj.de/blog/2020/12/14/provenance.html discussing how improper handling of pointer provenance can cause code to get miscompiled.

https://en.wikipedia.org/wiki/Provenance gives more on the term and the way it shows up.

You'll hear the term provenance used quite a bit on PBS's long running Antiques Roadshow.
Provenance is also used in wine and art where a chain of custody, which the value largely hinges on, must be through trustworthy people or institutions.

More interestingly, both wine and art have had their provenance hinges widely exploited for massive profit while posh people think they're enjoying something exclusive.

This is how I came to know the term. I was a fan of the PBS weekend lineups during the 90s.
Data provenance is a standard term of art in machine learning and data science, a “data 101” kind of thing, with many OSS and vendor tools built up to solve provenance problems, like DVC, Pachyderm, kubeflow, mlflow, neptune, etc.
worked with stats, machine learning and data science for 10+ years now. never heard the term used until now. (that's not to say I'm not familiar with the things the term refers to, indeed, most of the intellectual frameworks I've worked with break each of the things that make up provenance into far more fine grained concepts).

course, I've also never heard of or touched the software you listed there either, but that may be because I don't view the data science and machine learning I'm interested in as being about specific software or vendor software...

sounds more database- lingo to me...

It's a common term used in data governance. It's found less in the academic literature, and more in software demos and vendor brochures. You'll also hear "data lineage", which is the context in which the term arises.

"Provenance" just means where the data came from. [1]

It's one of those shibboleths and terms of art used by people in industry. If you go to trade-shows you'll hear it being used -- it's worth knowing if nothing else but for its sociological value among the data software tools crowd.

Side: it's a little like the word "inference" being used as a verb by folks in AI (example usage: we use GPUs to speed up model "inferencing") -- in AI, to inference means to "predict". It's a term of art. If someone with a traditional statistics background went to a deep learning conference, they are likely to be very confused because in traditional statistics, inference means to obtain parameters θ in a model y = f(x,θ), whereas in AI, inferencing refers to obtaining y.

[1] https://en.wikipedia.org/wiki/Data_lineage#Data_provenance

I've also worked as a data scientist for a few years and have never heard or used the word "provenance" in a DS context. Some people used it in the oil & gas industry when talking about where reservoir sands came from, but that usually garnered a eye-roll and mental translation to more everyday language.
Regardless of the term chosen, the concept of 'provenance' described here is the essential purpose behind the scientific notebooks used daily by experimentalists in industry and academia. Without thoroughly recording the bases for your experiment it almost surely will not be reproducible.

Where I work, (a large pharmaceutical), these notebooks are taken very seriously by biologists, chemists, and chemical engineers, and increasingly are shaping the mindset of our data scientists (who have yet to adopt them).

Given the longstanding practice of documenting experiment design and method, I think it's probably long overdue that the exploratory analysis of experiment-based data must also adopt more rigorous governance to ensure that necessity and sufficiency are ensured when drawing inferences from experiment, especially when the data used was not originally intended to answer the current question posed.

It’s shocking if you’ve worked professionally in statistics and not heard about data provenance.

A few publications from ~2011-2015 period:

http://ceur-ws.org/Vol-1558/paper37.pdf

https://ieeexplore.ieee.org/document/5739644

https://link.springer.com/chapter/10.1007/978-3-642-53974-9_...

Add a variety of additional links dating back a bit further (note the emphasis in this case on research data and tracking state of an experiment).

https://nnlm.gov/data/thesaurus/data-provenance

Data provenance is not a database / data warehouse term. It is uniquely and specifically a basic “101” concept of statistical science and ML / data science, where the custody and tracking of data are specifically tied to iterations of experiments, prototypes and research, for the sake of reproducibility.

If I was interviewing an experienced statistical researcher and they didn’t at least have a working knowledge of the core concepts, that would be a huge red flag.

I'm not saying it doesn't exist, but I am saying it must be jargon used within a particular community or minority subset of general stats/ machine learning/AI. honestly, I still think it's a database/ enterprise term because I've worked for our national statistics office and never heard it in the statistics community either. I have frequently heard data lineage however, but again, that's a database/ enterprise type person lingo: when people use that word I know immediately the background they're coming from.

Another poster mentioned vendor brochures and trade shows, which is in line with my expectations about which community it stems from, and also explains why I've never heard of it because I try to keep away from such environments these days.

Everywhere I've been the things which I take to make up "provenance" have generally been referred to under the simple label of "data quality", with separate subset definitions and measures such as timeliness, source, authority, format, history, suitability, verification, etc.

Of course, that's assuming people even worry about such things. In practice, let's be frank, anyone who's worked with data science knows they actually get shorter shrift than they deserve in practice: I'm probably among a minority of people in the real world who actually take things seriously, and I find myself on a constant crusade to remind people that just because a data point exists in a data set doesn't mean it's useful/ appropriate/ truthful/ unbiased.

data quality is a bit problematic, because I can see it being used by people who think provenance doesn't have any thing to do with quality, and from a variety of fields, but it is also infinitely more popular according to historical search trends, and in my last three jobs provenance would fall under the data quality framework.

It’s very, very widely used jargon. I’d put “data provenance” on par with “overfitting” or “GPU model training” in terms of the high, ubiquitous place it occupies in mainstream machine learning.
Sorry, I have to disagree here. Its a term of art in some of the literature, but it's definitely not that widespread, certainly not in consumer tech data science, where I work.
TIL.

I’ve worked as a data engineer for the last two years and never heard of this being used in this context before.

Typically the word “data lineage” is used to mean this in my experience.

I don’t think I’ve ever been in a meeting where someone mentioned provenance except referring to a show about paintings.

Lineage isnt the same thing, being a more specific technical term referring to keeping the history of datasets and where they came from (basically), but people actually say the words “data governance” and “lineage”.

Another important use of data provenance is in GDPR. You have to be able to know the source of each data you use and be able to remove them from storage and backups at request.
Wikipedia says "Provenance is conceptually comparable to the legal term chain of custody." https://en.wikipedia.org/wiki/Provenance
If you (ever) need to update your data, you need to know where you got it from, what was wrong with it originally, and how to pull it again.
Provenance, as a concept and specification, is well established in digital domain, as described by W3C's PROV specification https://www.w3.org/TR/prov-overview/ Ability to trace, audit, and reproduce artifacts or processes are some applications of provenance that align with needs for explainability in data analytics and data science/AI (XAI).
We address the problem of adding provenance without rewriting your tensorflow/scikit-learn/pytorch/pyspark application by adding CDC support in the ML stack and collecting all events in a metadata layer, building an implicit provenance graph. It's now part of the open-source Hopsworks platform. See this USENIX OpML'20 talk on it: https://www.youtube.com/watch?v=PAzEyeWItH4
It's weird to me that people build libraries on top of the ML stack to track provenance, when it's really the ML library's job to do that for its inputs. However it is a right pain building it into the ML library as it affects all the interfaces. We build data, model & evaluation provenance objects into our ML library, Tribuo (https://tribuo.org), as a first class part of the library. You can take a provenance and emit a configuration to rerun an experiment just by querying the model object. It is built in Java though, which makes it a little easier to enforce the immutability and type safety you need in a provenance system.

edit: I should add that I'm definitely in favour of having provenance in ML systems, and libraries layered on top are the way that people currently do that. It's just odd that people aren't working on adding that support directly into scikit-learn/TF/pytorch etc.

MLFlow and TFX try to add some form of provenance by polluting your code with "logging" calls. A good thing MLFlow has added is auto-loggers - we also added them in our Maggy framework ( https://www.logicalclocks.com/blog/unifying-single-host-and-... ).

I totally agree that where you have framework hooks, you should have provenance, but given there's no standard for what provenance is, no defacto open-source platform, the sklearn and tf and pytorch folks rightly steer clear. We see that if you have a shared file system, you can use conventions for path names (features go in 'featurestore', training data in 'training', models in 'models', etc), to capture a ton of provenance data.

I first encountered the term "provenance" in 2007 when I was working on an undergraduate research project at UC Santa Cruz in the area of metadata-based search for the Ceph distributed file system. I particularly remember reading this USENIX ATC 2007 paper: https://www.usenix.org/conference/2007-usenix-annual-technic.... This was my introduction to the concept of provenance.

Professor Margo Seltzer (https://www.seltzer.com/margo/) is a well-known researcher in the area of provenance. I highly recommend reading her papers if you're interested, starting with her USENIX ATC 2006 paper "Provenance-Aware Storage Systems".

"Intelligence Augmentation" makes good PR, like those old Apple commercials, "Be All You Can Be" (no, that's the US Army; Apple's was "The Power To Do Your Best.")

But the money is in replacing humans.

They may sound a bit cheeky, but those humans are us. I was happy when a client had asked something and I realised I didn't have to do it because our machine learning platform had already what was requested.

We had built it precisely to free us from certain repetitive things in machine learning projects [environment set up, near real-time collaboration on notebooks, scheduling long-running notebooks, experiment tracking, model deployment and monitoring]. We used to scramble and do all that, request help from our colleagues and pull them from what they were doing. This was really taxing and bad for morale, jumping around from one context to another.

I had a huge smile contemplating all the work I was about to not do.

There are many things where the humans themselves ought to be "augmented". Case in point, in some projects involving predictive maintenance, the stakes of an incident can be around $100MM and all these processes depend on a human being alert at all times during their very long shift, with a bunch of other things happening simultaneously. This is very stressful and these people actually want to be "augmented". They want something to help them and catch things they would have missed because they haven't had proper sleep or were too busy solving another urgent and important problem. It is the people themselves who come to us and ask us for our help to help them solve these problems.

It may sound cheeky, and in many cases at many companies it is cheeky and it is PR like saying "partners" instead of "drivers", or "dashers" instead of "delivery person". In some cases it really is what happens. At least from my biased perspective with the actual humans who were asking for "augmentation" to do their job.

Which is true, but those pesky humans, it turns out that really replacing them is soooo tricky!! They end up hanging around in the business process spending money and whining about how evil you are, and all the time your competitors are bolstering their employees capabilities making then happier and more productive and pushing wishy washy messages of corporate social responsibility at your customers.
I think the goal of all technology is not only to reduce cost (replace humans) but to improve the job they do — make them faster, stronger, smarter. IMO the enhancement part of the tech evolution equation has been much underconsidered.

Any time a tool evolves, it changes not only how a task is done but also why. In the case of implementing a business process, the revision process is best served by reconsidering why what is done now and taking the opportunity to evolve the old role into making a richer contribution that introduces a new and improved path through the problem space.

That's IA, and IMO, it's the Great White Hope that AI might yet lead to a future world that engages humans more rather than the default dystopia where we're all redundant and irrelevant.

I encountered term provenance when I was learning Apache Nifi.