Hacker News new | ask | show | jobs
by mattewong 1476 days ago
There is a better way to do this.

An example of output that fits this paradigm you describe-- but to a much further degree-- would be the dozens of tables shown in the securities offering described at https://www.sec.gov/Archives/edgar/data/0001561167/000114420... search for "Stated Principal Balances of the Mortgage Loans as of the Cut-off Date" (on page A-3).

How do you generate reports like this in a manner that is flexible for end users without requiring IT in the middle? You start with the bare input, which is: data + report logic:

1. Specify the common columns that you want in your output tables (i.e. columns other than the first). In your example, that would be order_count, gmv, net_gmv

2. Separately, specify the tables that you want to generate, where each table spec consists of:

  - the bucket logic (in your example, that would be a formula representing "age broken out in buckets of 10")

  - optional other characteristics such as whether the data should be filtered before going into the table, which column the table should be sorted on, whether to limit the table output to the "top [10]" rows, etc etc
3. Third, run your data, plus the above spec, through some software that will generate your report for you

As for part 3, my company has recently launched a free platform for doing all of the above in a collaborative and secure manner. Please reach out if you'd like more info on this. Of course, you can do it yourself or have your IT do it-- but be aware it is not as easy as it sounds when you start having to deal with real-world practicalities like schema variability and scalability. And anyway, why bother if you can now do it all for free?

1 comments

Thanks for the SEC reference!

Regarding the 1st point, let say I already have the process implemented in place. If I want to add new common column, is there a better way to easily add new common metrics/dimension without doing backfill?

Sorry for the delay, didn't see your reply so thought the thread was dead. Whether you want to add a new column to the input data, or add a new column to the report, I'm not seeing any need for backfilling either way.
If I only have order_count, net_gmv and gmv and then 1 month later I want to add avg_user_session, then what I understand is we need to backfill the `session_time` to the 'main' table and do the "avg(session_time) group by user_id" to the final table.

Or are you saying that we can just create another similar table with only avg_user_session and then join by the dimension key?