| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cirego 1884 days ago

Disclaimer: I work at Materialize.

This is the exact problem that we are solving here at Materialize! I wrote an example blogpost that details to how to subscribe to a SQL query in practice: https://materialize.com/a-simple-and-efficient-real-time-app...

Regarding your comment about "focus on streams", it's true that we first focused on extracting event data from other sources (Kafka, Debezium, direct change data capture from external databases such as Postgres, S3, etc). Over time, however, we plan to add additional features that will allow users to also treat Materialize as a general purpose database.

Hope this helps and happy to answer any questions!

edit: I was imprecise in my usage of the term event streams. Materialize supports inserts, updates and deletes at its core (the topK query shown in the blog post above shows this). Materialize is a more general solution than something focused on append-only event streams.

4 comments

albertwang 1884 days ago

As cirego mentioned, we (Materialize) have the TAIL operator, which was built to allow users to subscribe to changes:

https://materialize.com/docs/sql/tail/

link

jfim 1884 days ago

Out of curiosity, how does TAIL behave with updates?

For example, say that I have a table foo that has a single row, if that row's contents is updated, would I get two updates (a deletion and an insert)?

link

benesch 1884 days ago

Yes, exactly. Roughly speaking you'd see something like this:

    (id, text1, text2, timestamp, diff)
    ----
    (42, before, data, Friday April 23 12:27AM, -1)
    (42, after, data, Friday April 23 12:27AM, +1)

link

cirego 1884 days ago

You are correct! Updates are a expressed as a retraction and an insert that happen within the same timestamp.

An example may not be necessary but it might also help clarify. Assuming you're using the psql client to run "TAIL WITH (PROGRESS)", the logical grouping for a single update will be a set of rows like the following:

  ...
  1608081358001 f -1 ['Lockal', '4590']
  1608081358001 f 1 ['Epidosis', '4595']
  1608081358001 f -1 ['Matlin', '5220']
  1608081358001 f 1 ['Matlin', '5221']
  1608081359001 t \N ['\\N', '\\N']
  ...

All of these occur at the same timestamp, meaning that they should be applied atomically to maintain consistency of your dataset. In this case, my query is a top-10 query and Epidosis has now entered the top10 while Lockal has dropped out of the top10. Matlin remains in the top10 but their total has gone from 5220 to 5221. The final example record is produced when you run with PROGRESS enabled and serves as an indicator that 1608081359001 is now closed and no further updates will ever happen at timestamp 1608081359001.

I find that this stream of rows is very easy to convert to a data structure "{timestamp, inserts[], deletes[]}" and this, in turn, maps naturally onto reactive APIs, such as React or D3. My blog post, linked above, delves into this in more detail. Hope this explanation helps!

link

jfim 1883 days ago

That's a great example, thanks a lot!

link

gst 1884 days ago

Materialize definitely looks interesting. I'd have a few questions regarding the current product:

Does Materialize provide a commercial non-cloud version where workers can run on multiple nodes in a cluster? On the website I only see references to the BSL-licensed single-node version and to the cloud version that is hosted at Materialize. Would a company be able to run Materialize on their own Kubernetes cluster?

Also is it possible to add custom code for data sources (for example Kafka, but with a different format than what Materialize currently expects) or is Materialize limited to the pre-defined sources?

link

albertwang 1884 days ago

Sorry if it's a bit of an enterprisy-response, but please reach out! We are supporting some non-cloud commercial customers on a case-by-case basis. The reason for this is that we find the support and maintenance burden to be much higher with a non-cloud delivery model, which isn't always a great experience for either party. We also have a managed product (where the data plane resides in your environment) that may work, depending on your infra and security requirements.

Re: custom code -- our codebase is fully source-available and open to contributions, but the source+sink code going through some refactoring to make it more beginner-friendly. Depending on your consistency requirements, we also support Debezium and our own CDC format (https://materialize.com/docs/connect/materialize-cdc/) for folks who want to bring in their own data sources. (For quick prototypes, we also support csv/json/plaintext source types, as well as SQL INSERTs!)

link

atwebb 1884 days ago

Following your project and Airbyte very closely, both seem to have caught and, if coupled, would provide some pretty compelling use cases, especially for retaining control of systems.

Thought for a while most of the analytics engines are really just getting around materialized view limitations.

link

albertwang 1884 days ago

Exactly! Since most ELT tools (including Airbyte) support json + csv output formats, those work perfectly well with Materialize out-of-the-box. I'm playing around with Slack+Stripe Airbyte sources to try and come up with some fun dashboards to show off in Materialize as we speak.

link

aravindet 1884 days ago

I am considering Materialize for a project, and I'm concerned about the limitation that materialized views should fit into the memory of a single host.

My use case is a materialized OLAP "fact" tables, created by joining a 5-10 separate tables. While the biggest source tables have under a million rows, the joined fact table would have perhaps 100 million rows which I fear might be too big.

One workaround I considered is creating a non-materialized view, TAIL-ing it, and persisting the results to another database.

Could you comment on this problem and workaround?

link

cirego 1883 days ago

At the moment, we are focused on scaling up with a single instance. While there are a couple of strategies for scaling out, none of our current methods support scaling a single view beyond a single instance. This isn't really a technical limitation but rather more of a testing and supportability limitation. Timely and Differential (the incremental computation frameworks used by Materialize) suport scale-out and have been extensively tested in scale-out scenarios.

Do you have a limit in mind on how large you're willing to go? A brief bit of napkin math shows that, at 1KB per record, your largest view should fit within 100GB of memory. This is easily supported by Materialize and I can certainly understand if this exceeds your appetite!

If you're interested in reducing the size of the in-memory dataset, does your fact table have a temporal dimension to it? By default, Materialize stores data for all time (like a database) and you can write views in such a way that it will only materialize recent data (like a stream processor). Our co-founder Frank wrote a blog post[1] detailing how to do this.

I'm honestly not sure how TAIL-ing a non-materialized view would perform. Sounds like something that would be fun to test!

Happy to chat about this further in our community Slack channel if you have more questions.

[1] https://materialize.com/temporal-filters/

link