Hacker News new | ask | show | jobs
by ealready_value 7 days ago
The source form is the production database, which is what the current reports pull from. The canonical form is the form that in theory all of the verticals get rolled into, but many of the nuances that our customers are used to having end up getting replaced with similar, but are not quite the same. Right now that's my biggest concern that customers are not going to get the data they need because of this canonical form.

We're talking about a few-hundred megabytes of data for all of the customers that these reports pull, but that's also for the past 15 years. We do have like 25k customers, which shrinks how much a customer can pull in even further. One last point is that we already de-normalize the report data into its own table specifically for these reports, so that's not something the data warehouse is doing for us.

I agree with your experience with QuickSight, it is exactly my experience. My preference is to continue using the reports we generate in the app, but I'm trying to wrap my head around cases where this ends up being the better direction.

1 comments

What was the point of creating the "canonical form" if you already had reports being generated in-app? Was it just someone's pet project, or were there supposed to be other benefits?
I've not gotten a straight answer. I assume it is a pet project kind of situation, or trying to justify the data warehouse project as a whole, but I really don't know the real driver to do this.
These sorts of odd projects are relatively common. A few years ago I was brought on near the end of a data engineering project where somebody had decided they needed multiple databases, a crap load of JSON exports, and dozens of python, R, and shell scripts running inside some job orchestrator to support what amounted to a few megabytes of data being processed each day. Maybe 5 megabytes max.

There wasn't even a lot of transformation going on. It was just... strange. I witnessed some true eldritch horrors like Python calling R calling a shell script that called the mysql client, which wrote data to a temporary file that was eventually read by the great-grand-parent python script.