Hacker News new | ask | show | jobs
by roel_v 2827 days ago
None of these frameworks (there are many) seem to have support for repeating a certain target multiple times, with different arguments. For example, say you have a data set with per-country data; how do you repeat the same analysis on each country? This simple example is easy with a loop, but when you have multiple dimensions like this, you want to call each target with all possible permutations, depending on which type of dimension is actually relevant for that target. Does any ETL framework support that?

(I was actually just writing a spec for a new tool that does just this this afternoon because I can't find anything suitable)

2 comments

snakemake does this trivially:

    rule analyze_country:
        input: 'whatever.{country}.txt'
        output: 'analysis.{country}.txt'
        shell:
            'run-analysis-on-country {input} {output} --country=country'

    rule analyze_target_countries:
        input: ['analysis.usa.txt', 'analysis.canada.txt', 'analysis.mexico.txt']
Small change, you have to use wilcards.country inside the shell call:

    rule analyze_country:
        input: 'whatever.{country}.txt'
        output: 'analysis.{country}.txt'
        shell:
            'run-analysis-on-country {input} {output} --country={wildcards.country}'
Sounds like a group_by and then do a function per group? Pypeline doesn't have grouping but sounds like Spark or Dask should the the job.