| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vinay_ys 1745 days ago

Very nice write up. For the story from 2010 to now, I would mention the emergence by MPP columnar processing systems like Vertica and in-memory distributed systems like MemSQL to the narrative. Of course Kylin, Clickhouse etc are great open-source contenders (although at the time I looked into them (~5-6 years ago), they were not mature enough).

In my experience, people often underestimate the continuous effort to maintain the the Kimball's "Enterprise Data Warehouse Bus Architecture" diagram, even with more powerful machines and modern distributed tooling.

In today's fast evolving Internet apps world, the data use cases and scenarios are very fast evolving. That brings its own set of challenges.

Having good usable tools for managing the lifecycle of entity or event definitions, their variants like emitted/logged vs cleaned/processed/synthesized, their data quality checks etc and ensuring they are easily discoverable and understandable by everyone in the org is super crucial and it is significantly under-appreciated.

Usually, strong systems engineers who are in charge of the data platform focus on building the data infra (job scheduling, data pipelines execution, storage etc) but the crucial work of defining the data dictionaries, event or entity models etc are left out. The data producers and data consumers who are spread out throughout the organization have to muddle through this on their own without any centralized tooling to support this activity. These make data use very difficult and siloed.

Usually, there would be a team of BI analysts who are tasked to get some answers out of the data for the questions asked of them by various data users. Funnily these analysts are also working in silos assigned to those different data users. Inevitably, they become the super-inefficient intermediary between the data users and the data insights.

The pre-cooked data insights are presented in spreadsheets and slides in review meetings – where a narrative is already prepared by the analysts.

This robs the opportunity for the data users to explore and ask data questions on their own in a fast iteration cycle to improve their intuition and understanding of their product/business environment.

IMO, these challenges still remain largely unsolved even to this day across organizations of all size and scale.