| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by riordan 1078 days ago

> In this context-- the section in article where it says present data is of virtually zero importance to analytics is no longer true. We need a real solution even if we apply those (presumably complex and costly) solutions to only the most deserving use cases (and not abuse them).

Totally agreed, though where real-time data is being put through an analytics lens is where CDW's start to creak and get costly. In my experience, these real-time uses shift the burden from being about human-decision-makers to automated decision-making and it becomes more a part of the product. And that's cool, but it gets costly, fast.

It also makes perfect sense to fake-it-til-you-make-it for real-time use cases on an existing Cloud Data Warehouse/dbt style _modern data stack_ if your data team's already using it for the rest of their data platform; after all they already know it and it's allowed that team to scale.

But a huge part of the challenge is that once you've made it, the alternative for a data-intensive use case is a bespoke microservice or a streaming pipeline, often in a language or on a platform that's foreign to the existing data team who's built the thing. If most of your code is dbt sql and airflow jobs, working with Kafka and streaming spark is pretty foreign (not to mention entirely outside of the observability infrastructure your team already has in place). Now we've got rewrites across languages/platforms, and leave teams with the cognitive overhead of multiple architectures & toolchains (and split focus). The alternative would be having a separate team to hand off real-time systems to and only that's if the company can afford to have that many engineers. Might as well just allocate that spend to your cloud budget and let the existing data team run up a crazy bill on Snowflake or BigQuery as long as it's less than the cost of a new engineering team.

------

There's something incredible about the ruthless efficiency of sql data platforms that allows data teams to scale the number of components/engineer. Once you have a Modern-Data-Stack system in place, the marginal cost of new pipelines or transformations is negligible (and they build atop one another). That platform-enabled compounding effect doesn't really occur with data-intensive microservices/streaming pipelines and means only the biggest business-critical applications (or skunk works shadow projects) will get the data-intensive applications[1] treatment, and business stakeholders will be hesitant to greenlight it.

I think Materialize is trying to build that Modern-Data-Stack type platform for real-time use cases: one that doesn't come with the cognitive cost of a completely separate architecture or the divide of completely separate teams and tools. If I already had a go-to system in place for streaming data that could be prototyped with the data warehouse, then shifted over to a streaming platform, the same teams could manage it and we'd actually get that cumulative compounding effect. Not to mention it becomes a lot easier to then justify using a real-time application the next time.

[1]: https://martin.kleppmann.com/2014/10/16/real-time-data-produ...