| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rtpg 3447 days ago
	I've been thinking about this space a lot too, would you mind listing out some of the messier use cases that you have?

2 comments

mtrn 3447 days ago

> I've been thinking about this space a lot

Me too, for better or for worse.

As for the issues, there are many. Just quickly a few:

* Data provider has an FTP server, most files are automatically generated, some are hand-named (with inconsistencies). How do you handle (without a lot of effort) a list of exceptions along with the regular files?

* Data provider has a good strict XML schema, but the relevant information for a single item is spread across three files, inside a tar archive. Since the there are 500k files inside the archive, you best not want to extract it, but process it on the fly.

* Data provider chooses layout that saves every item in a single XML file, inside 2-3 levels of directories. There are 20M of them. Unzipping the archive alone takes more than a day with default system settings and the usual tools. How do you process these things fast?

There are more subtle issues as well:

* FFFD regularly occurs in natural language strings. Can you correct these strings?

* File has .csv ending, looks like CSV on first glance, but all the standard RFC compliant parsers choke on it.

* XML file that elements, that have RTF tags embedded in it. You need to parse the RTF in the elements, because there is relevant information there, that you need to add to the transformed version.

* Date issues. Inconsistent formats and almost-valid dates.

* Combine data, coming from an API with data fetched from ten different servers to produce a transformed version with a legacy command line application (that might be slow, so you have to split your data first and parallelize the work, combine it and make sure it's complete).

I am thinking about a longer article or even a short book about these kind of data handling and quality questions and what ways there are to address them. Would you read a book like this and what topic would be the most pressing or relevant?

link

DenisM 3447 days ago

That kind of book would be a great service to humanity. I don't know if you will sell many, but anyone inventing a new ETL tool would be served well by reading it. Perhaps a paper for a journal like ACM would be a better format. Or you could make it into a wiki. Or an "ETL Nightmares monthly" newsletter, with best user submissions.

link

voltagex_ 3447 days ago

This is what an "ETL" (Extract-Transform-Load) tool is for. Something like FME Server [1] would handle the first two points and the last point well.

For unzipping something that crazy, I'm interested in your solution - I think I'd have to write a custom zip library and use a RAMdisk or similar.

1: https://www.safe.com/fme/fme-server/

link

mtrn 3447 days ago

Yes, that's ETL. Classic ETL dealt with databases, the modern variant has relaxed this constraint.

As for the zip: We simply "unzip -p" and stream process it carefully (with a custom program reading XML and transforming it). Cuts processing time from hours (extracting the zip and creating all directories, then visiting each file) to minutes (read from a single file).

link

rcthompson 3447 days ago

Here's one example where I had to use a kind of ugly hack ot make it work with Snakemake, a Python Makefile-style "DAG-of-rules" workflow tool: https://github.com/DarwinAwardWinner/CD4-csaw

Basically, I need to first fetch the metadata on all the samples, and then later group them by treatment based on that metadata. In other words, the structure of later parts of the DAG depends on the results of executing earlier parts of the DAG, so the full structure of the DAG is not known initially. The solution I used was to split the workflow in two: a "pre-workflow workflow" that fetches the sample metadata and then the main workflow which reads the metadata and builds the DAG based on it. See here: https://github.com/DarwinAwardWinner/CD4-csaw/blob/master/Sn...

This a common pattern that I see when putting together bioinformatics workflows: the full DAG of actions to execute cannot be known until part of the way through executing that DAG. Most workflow tools can't handle this gracefully. Another Python DAG-executor, called doit, can handle this case, by specifying that some rules should not be evaluated until after others have finished running. But it doesn't have some features that I wanted from Snakemake (e.g. compute cluster execution), so I ended up with the above solution instead.

link

elsherbini 3447 days ago

I use snakemake quite a bit, it was cool to scan through your Snakefile and learn some things. The processify decorator looks really useful[0].

It's possible that you could use snakemake subworkflows [1] for this issue of "pre-workflow" workflows.

[0] https://github.com/DarwinAwardWinner/CD4-csaw/blob/master/pr...

[1] https://bitbucket.org/snakemake/snakemake/wiki/Documentation...

link

rcthompson 3446 days ago

I also use a subworkflow in this workflow, but for a different purpose (the subworkflow is also on Github: https://github.com/DarwinAwardWinner/hg38-ref). But subworkflow rules are still resolved as part of the same DAG, so they have the same issue. Hence the need for a separate pre-workflow outside the normal framework of Snakemake.

By the way, I guess I didn't add a comment explaining this, but the reason for using the processify decorator is that the snakemake API is not re-entrant, so calling `snakemake` from within a Snakefile normally breaks things. The solution is to call it in a separate process.

link