Hacker News new | ask | show | jobs
by slap_shot 2306 days ago
> We believe that streaming architectures are the only ones that can produce this ideal data infrastructure.

I just want to say this is a very dangerous assumption to make.

I run a company that helps our customers consolidate and transform data from virtually anywhere in their data warehouses. When we first started, the engineer in me made the same declaration, and I worked to get data into warehouses seconds after and event or record was generated in an origin system (website, app, database, salesforce, etc).

What I quickly learned was that analysts and data scientists simply didn't want or need this. Refreshing the data every five minutes in batches was more than sufficient.

Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.

To be clear, I don't discount the need of telemetry and _some_ data to be actionable in a smaller time frame, I'm just weary of a data warehouse fulfilling that obligation.

In any event, I do think this direction is the future (an overwhelming amount of data sources allow change data capture almost immediately after an event occurs), I just don't think it's only architecture that can satisfy most analysts'/data scientists' needs today.

I would love to hear the use cases that your customers have that made Materialize a good fit!

5 comments

I architected and implemented a true-realtime telemetry pipeline. The requirement was subsecond per-user aggregation and round-trip notification of thresholds exceeded. Took us a couple years, but when Halo 5 launched, we handled 2.5B events/hour without breaking a sweat (AMQP over Websockets). It's since been rolled out to multiple Microsoft 1st-party games.

The round-trip requirement was dropped before we launched, reducing the usage of the technology stack to pure telemetry gathering.

The analysts are all perfectly happy with 5-10 minute delays.

Link to my GDC talk, in case people are interested: https://www.youtube.com/watch?v=o098roxWAkA

> I just want to say this is a very dangerous assumption to make.

I think we're actually arguing the same points here. It's not that every use case needs single-digit millisecond latencies! There are plenty of use cases that are satisfied by batch jobs running every hour or every night.

But when you do need real-time processing, the current infrastructure is insufficient. When you do need single-digit latency, running your batch jobs every second, or every millisecond, is computationally infeasible. What you need is a reactive, streaming infrastructure that's as powerful as your existing batch infrastructure. Existing streaming infrastructure requires you to make tradeoffs on consistency, computational expressiveness, or both; we're rapidly evolving Materialize so that you don't need to compromise on either point.

And once you have streaming data warehouse in place for the use cases that really demand the single-digit latencies, you might as well plug your analysts and data scientists into that same warehouse, so you're not maintaining two separate data warehouses. That's what we mean by ideal: not only does it work for the systems with real-time requirements, but it works just as well for the humans with looser requirements.

To give you an example, let me respond to this point directly:

> Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.

The idea is that you would have your analysts write these ETL pipelines directly in Materialize. If you can express the cleaning/de-duplication/aggregation/projection in SQL, Materialize can incrementally maintain it for you. I'm familiar with a fair few ETL pipelines that are just SQL, though there are some transformations that are awkward to express in SQL. Down the road we might expose something closer to the raw differential dataflow API [0] for power users.

[0]: https://github.com/TimelyDataflow/differential-dataflow

I think what might be really unique here that people aren't imagining, are the new possible applications of having <100ms updates on complex materialized views.

With sufficiently expressive SQL and UDF support there are whole classes of stateful services that are performing lookups, aggregations, etc, that could be written as just views on streams of data. Experts who model systems in SQL, but aren't experts in writing distributed stateful streaming services would basically be able to start deploying services.

Are there any plans to support partitioned window functions, particularly lag(),lead(),first(),last() OVER() ? That would be remarkably powerful.

I agree wholeheartedly with your take!

Window functions are a particular favorite of mine, but we haven’t seen much customer demand for them yet, so they haven’t been officially scheduled on the roadmap. They require some finesse to support in a streaming system, as you have to reconstruct the potentially large window whenever you receive new data. Probably some interesting research to be done here, or at least some interesting blog posts from Frank.

Please feel free to file issues about any of these functions that you’d like to see support for! We especially love seeing sample queries from real pipelines.

I have a strong suspicion that bitemporalism makes a lot of these problems less problematic. The actual volumes of data are the same, but the all-or-nothingness of windowing over very large data sets in order to avoid missing anything that arrived late goes away.

I wrote shambolic stream-of-consciousness notes on it several years ago: https://docs.google.com/document/d/1ZlPp099_fV1lyYWACSyuWY_j...

The gist being that the mechanisms of windowing, triggering and retraction a la Beam are actually workarounds for a lack of bitemporalism.

We’re thinking along very similar lines! We’ve got some of our thoughts around evolving how timestamps work in Materialize written down here: https://github.com/MaterializeInc/materialize/issues/1309
Sort of -- the problem I see in the event time / processing time distinction is that it's about instants rather than intervals. There are a number of models and queries that are not reliably expressible with instants alone, unless you reinvent intervals with them.

For example, if I rely on "updated-at" and infer that whatever record has the latest updated-at is the "current" record, then I may create the illusion that there are no gaps in my facts. That may not be so.

A reference system to look at is Crux: https://opencrux.com/

This feels like the philosophical conclusion that Kafka Streams has made, i.e. you don't have a strict watermark, and if you really want you can theoretically keep updating and retracting data forever, and build a pipeline that magically stays in sync.
Partially, in my understanding, but not fully. An advantage of bitemporalism that is hard to recreate is queries about past and future states of belief. "What do I believe is true today?" works well with accumulation and reaction and with standard normalised schemata.

"What do I believe I believed yesterday?" is slightly harder and needs additional information to be stored. You can rewind a stream and replay it up to the point of interest, but that can be quite slow.

"What did I believe today would be, last week?", "What is the history of my belief about X?", "have I ever believed Y about X?" etc are much harder to answer quickly without full bitemporalism. So too the problem of having implicit intervals that are untrue, which is where "updated at" can be so misleading.

I think the average data team at a startup using something like Redshift/BigQuery/Snowflake is using window functions quite extensively when writing analytical queries/building data pipelines so I'm surprised to hear you haven't see much customer demand for them.

If you trying to wholesale replace a "traditional" batch oriented data warehouse like the ones I mentioned above I think building support for window functions would be essential.

To be clear, window functions are definitely on our radar! But you’d be surprised how many folks are delighted just to have more basic SQL features, like fully functional joins. Streaming SQL is surprisingly far behind batch SQL.
I could see a usecase for rapid automated A/B testing, using ML to react to performance metrics. Why have human editors on cnn.com when you can in real time do story selection based on view traffic?

That said, hope we as technologists can find some better use cases than just stealing more of people’s a attention.

I work in the sports stats industry and we've been talking to Materialize for a while. We have a lot of demand for real time data, especially from the media and gambling sectors. The SLAs are mostly in seconds not milliseconds, so we currently don't have to wring every last ounce of latency out of our pipeline, but I'm very excited about the product because it offers a good balance of performance and developer productivity. I can even imagine a universe in which we directly offer the capability to customers to be able to build their own real-time KPIs in a language that is fairly accessible and easy to hire for.
> Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data.

The point of Materialize, from my understanding, is that you don't put things into the data warehouse and then, as a separate step, run these enrichment/reporting jobs on it.

Instead, you register persistent, stateful enrichment "streaming jobs" (i.e. incrementally-materialized views) into the data warehouse; and then, when data comes into a table upstream of these views, it gets streamed into and through the related job to incrementally populate the matview.

I believe you end up with a consistent MVCC snapshot between the source data and its dependent inc-matviews, where you can never see new source data and old derived data in the same query; sort of as if the inc-matview were being updated by a synchronous ON INSERT trigger on the original raw-data ingest transaction. (Or—closer analogy—sort of like the raw table has a computed-expression index visible as a standalone relation.)

Also, the job itself is written as a regular OLAP query over a fixed-snapshot dataset, as a data-scientist would expect; but this gets compiled by the query planner into a continuous CTE/window query sorta thing, that spits out running totals/averages/etc. as it gets fed incremental row-batches—but only at the granularity that consumers are demanding from it, either by directly querying the inc-matview, or by sourcing it in their own streaming jobs. Which, of course, you do by just using a regular OLAP query on inc-matview X in your definition of inc-matview Y.

> I would love to hear the use cases that your customers have that made Materialize a good fit!

We haven't used Materialize yet, but it's a very promising fit for our use-case: high-level financial-instrument analytics on blockchain data. We need to take realtime-ish batches of raw transaction data and extract, in est, arbitrarily-formatted logs from arbitrary program-runs embedded in the transaction data (using heavy OLAP queries); recognize patterns in those that correspond to abstract financial events (more heavy OLAP queries); enrich those pattern-events with things like fiscal unit conversions; and then emit the enriched events as, among other things, realtime notifications, while also ending up with a data warehouse populated with both the high- and low-level events, amenable to arbitrary further OLAP querying by data scientists.

> Instead, you register persistent, stateful enrichment "streaming jobs" (i.e. incrementally-materialized views) into the data warehouse; and then, when data comes into a table upstream of these views, it gets streamed into and through the related job to incrementally populate the matview.

This is correct. There is an example in the docs of dumping json records into a source and then using materialized views to normalize and query them - https://materialize.io/docs/demos/microservice/

I think it would be helpful if you could dive deeper why you think " Refreshing the data every five minutes in batches" is "sufficient".

From my perspective: batching is more complicated, than batching. (Batching requires you to define parameters like batch size and interval, while streaming does not for example). But may be batching tools are simpler than streaming tools, but i am not so sure.

Batching in general has also high(er) latency. That's why I usually don't prefer it unless:

That said batching has an advantage over streaming, it can ammortise a cost that you only pay once per batch process. With streaming you would pay the cost for each items as it arrives.

Further, the mindset requirements for engineers that work with batching is different than for streaming.

Each of these items can be valid concern for batching vs streaming. However, I find it difficult to value statements like "Batching" is the default because the industry has been doing this for years by default.

I think the industry as a whole benefits when engineers in these kind of discussions repeat why certain conditions lead to a choice like batching.

> I think it would be helpful if you could dive deeper why you think " Refreshing the data every five minutes in batches" is "sufficient".

Not OP, but I'm guessing because most of that data is not actionable in real-time. There's zero point to get real-time data to analysts or decision makers if they're not going to use it to make real-time decisions; arguably, it can be even counterproductive, leading to an organizational ADHD, where people fret over minute-to-minute changes, where they should be focusing on daily or monthly running averages.

While they focus on the very-fast-updates thing, I think their technology will apply to batch cases also. In either of streaming or batching I want to do the least possible work, their claim is that they can skip a lot of unnecessary computations automatically.

That said, I find that batch systems have enormous inertia due to simple don't-touch-it syndrome. A report got developed in 1992 for a manager who retired in 1998 and died in 2009. Each night it churns through 4 billion records in a twelve-way join that costs tens of thousands of dollars of computing time per year.

Who reads this report? Nobody. In fact, the person who asked for it read it two or three times and then stopped. But it's landed reliably in an FTP folder for 28 years and by god nobody is game to find out whether the CEO reads it religiously.