As for the issues, there are many. Just quickly a few:
* Data provider has an FTP server, most files are automatically generated, some are hand-named (with inconsistencies). How do you handle (without a lot of effort) a list of exceptions along with the regular files?
* Data provider has a good strict XML schema, but the relevant information for a single item is spread across three files, inside a tar archive. Since the there are 500k files inside the archive, you best not want to extract it, but process it on the fly.
* Data provider chooses layout that saves every item in a single XML file, inside 2-3 levels of directories. There are 20M of them. Unzipping the archive alone takes more than a day with default system settings and the usual tools. How do you process these things fast?
There are more subtle issues as well:
* FFFD regularly occurs in natural language strings. Can you correct these strings?
* File has .csv ending, looks like CSV on first glance, but all the standard RFC compliant parsers choke on it.
* XML file that elements, that have RTF tags embedded in it. You need to parse the RTF in the elements, because there is relevant information there, that you need to add to the transformed version.
* Date issues. Inconsistent formats and almost-valid dates.
* Combine data, coming from an API with data fetched from ten different servers to produce a transformed version with a legacy command line application (that might be slow, so you have to split your data first and parallelize the work, combine it and make sure it's complete).
I am thinking about a longer article or even a short book about these kind of data handling and quality questions and what ways there are to address them. Would you read a book like this and what topic would be the most pressing or relevant?
That kind of book would be a great service to humanity. I don't know if you will sell many, but anyone inventing a new ETL tool would be served well by reading it. Perhaps a paper for a journal like ACM would be a better format. Or you could make it into a wiki. Or an "ETL Nightmares monthly" newsletter, with best user submissions.
Yes, that's ETL. Classic ETL dealt with databases, the modern variant has relaxed this constraint.
As for the zip: We simply "unzip -p" and stream process it carefully (with a custom program reading XML and transforming it). Cuts processing time from hours (extracting the zip and creating all directories, then visiting each file) to minutes (read from a single file).
Here's one example where I had to use a kind of ugly hack ot make it work with Snakemake, a Python Makefile-style "DAG-of-rules" workflow tool: https://github.com/DarwinAwardWinner/CD4-csaw
Basically, I need to first fetch the metadata on all the samples, and then later group them by treatment based on that metadata. In other words, the structure of later parts of the DAG depends on the results of executing earlier parts of the DAG, so the full structure of the DAG is not known initially. The solution I used was to split the workflow in two: a "pre-workflow workflow" that fetches the sample metadata and then the main workflow which reads the metadata and builds the DAG based on it. See here: https://github.com/DarwinAwardWinner/CD4-csaw/blob/master/Sn...
This a common pattern that I see when putting together bioinformatics workflows: the full DAG of actions to execute cannot be known until part of the way through executing that DAG. Most workflow tools can't handle this gracefully. Another Python DAG-executor, called doit, can handle this case, by specifying that some rules should not be evaluated until after others have finished running. But it doesn't have some features that I wanted from Snakemake (e.g. compute cluster execution), so I ended up with the above solution instead.
I also use a subworkflow in this workflow, but for a different purpose (the subworkflow is also on Github: https://github.com/DarwinAwardWinner/hg38-ref). But subworkflow rules are still resolved as part of the same DAG, so they have the same issue. Hence the need for a separate pre-workflow outside the normal framework of Snakemake.
By the way, I guess I didn't add a comment explaining this, but the reason for using the processify decorator is that the snakemake API is not re-entrant, so calling `snakemake` from within a Snakefile normally breaks things. The solution is to call it in a separate process.
Me too, for better or for worse.
As for the issues, there are many. Just quickly a few:
* Data provider has an FTP server, most files are automatically generated, some are hand-named (with inconsistencies). How do you handle (without a lot of effort) a list of exceptions along with the regular files?
* Data provider has a good strict XML schema, but the relevant information for a single item is spread across three files, inside a tar archive. Since the there are 500k files inside the archive, you best not want to extract it, but process it on the fly.
* Data provider chooses layout that saves every item in a single XML file, inside 2-3 levels of directories. There are 20M of them. Unzipping the archive alone takes more than a day with default system settings and the usual tools. How do you process these things fast?
There are more subtle issues as well:
* FFFD regularly occurs in natural language strings. Can you correct these strings?
* File has .csv ending, looks like CSV on first glance, but all the standard RFC compliant parsers choke on it.
* XML file that elements, that have RTF tags embedded in it. You need to parse the RTF in the elements, because there is relevant information there, that you need to add to the transformed version.
* Date issues. Inconsistent formats and almost-valid dates.
* Combine data, coming from an API with data fetched from ten different servers to produce a transformed version with a legacy command line application (that might be slow, so you have to split your data first and parallelize the work, combine it and make sure it's complete).
I am thinking about a longer article or even a short book about these kind of data handling and quality questions and what ways there are to address them. Would you read a book like this and what topic would be the most pressing or relevant?