| I'm currently working on how to speed up our analytics report development workflow. So imagine, you have this table called A with this structure +----------+--------------+---------+----------+ | location | order_count | gmv | net_gmv | +----------+--------------+---------+----------+ | TX | 1000 | 9000.0 | 8000.0 | | FL | 1000 | 9000.0 | 8000.0 | +----------+--------------+---------+----------+ then you want to have another table called B with this structure +-------+--------------+---------+----------+ | age | order_count | gmv | net_gmv | +-------+--------------+---------+----------+ | 20-30 | 1000 | 9000.0 | 8000.0 | | 30-40 | 1000 | 9000.0 | 8000.0 | | 40-50 | 1000 | 9000.0 | 8000.0 | +-------+--------------+---------+----------+ The location and age are the dimension needed for the report, eventually we'll be having different dimension needed for our report. What we're doing now is we develop a Spark-SQL job for each table. But we think this is not gonna scale because every time we want to add new dimension, we need to develop the Spark-SQL job again (same logic but different group by dimension) So I'm wondering whether there's a better way to do this. Anyone has any experience with this kind of problem before? Any pointer how to do this efficiently (I'm thinking someone could just specify the dimension they need and there'll be a script where it'll automatically generate the new table based on the specified dimension) Thanks |
An example of output that fits this paradigm you describe-- but to a much further degree-- would be the dozens of tables shown in the securities offering described at https://www.sec.gov/Archives/edgar/data/0001561167/000114420... search for "Stated Principal Balances of the Mortgage Loans as of the Cut-off Date" (on page A-3).
How do you generate reports like this in a manner that is flexible for end users without requiring IT in the middle? You start with the bare input, which is: data + report logic:
1. Specify the common columns that you want in your output tables (i.e. columns other than the first). In your example, that would be order_count, gmv, net_gmv
2. Separately, specify the tables that you want to generate, where each table spec consists of:
3. Third, run your data, plus the above spec, through some software that will generate your report for youAs for part 3, my company has recently launched a free platform for doing all of the above in a collaborative and secure manner. Please reach out if you'd like more info on this. Of course, you can do it yourself or have your IT do it-- but be aware it is not as easy as it sounds when you start having to deal with real-world practicalities like schema variability and scalability. And anyway, why bother if you can now do it all for free?