| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chadthenderson 4042 days ago
	This looks very cool. Although, I'm not sure I totally understand how it can be used to replace batch ETL processes. So, PipelineDB eliminates ETL batch processing by incrementally inserting data into continuous views, but the documentation says that it's not meant for ad-hoc data warehouses as the raw data is discarded. So, does that leave me still using batch processes to load my data warehouse? Is PipelineDB going to be my data warehouse as long as I only want the resulting streamed data? Just trying to figure out what this would look like and where its place is in a data warehouse environment.

1 comments

grammr 4042 days ago

Hey Chad, PipelineDB co-founder here. PipelineDB certainly isn't intended to be the only tool in your data infrastructure. But whenever the same queries are being repeatedly run on granular data, those are the types of situations in which it often makes a lot sense to just compute the condensed result incrementally with a continuous view, because that's the only lens it's ever viewed through anyways (dashboards are a great example of this). Continuous views can be further aggregated and queried like regular tables too.

In terms of not requiring that raw data be stored, a typical setup is to keep raw data somewhere cheap (like S3) so that it's there when you need it. But granular data is often overwhelmingly cold and never looked at again so it may not always be necessary to store it all in an interactively queryable datastore.

As I mentioned, PipelineDB certainly doesn't aim to be a monolithic replacement for all adjacent data processing technologies, but there are areas where it can definitely introduce significant efficiency.

link

chadthenderson 4042 days ago

Great. Thank you for the clarification. What you just described definitely sounds like something PipelineDB would be great for. I can see it being especially useful for quickly standing up dashboards and maybe even datamarts when considering new data sources. I just wanted to make sure that I wasn't missing something.

link

reubano 4042 days ago

So what's the best practice for when you want a real time dashboard but also want the ability to compare data overtime. E.g., ave. bounce rate this month vs last? Is Pipeline still ideal in this case?

link

Fergi 4041 days ago

Jeff (PipelineDB Co-Founder, here) - Yes, PipelineDB is great for this use case. One powerful aspect of PipelineDB is that it is a fully functional relational database (a superset of PostgreSQL 9.4) in addition to a streaming-SQL engine we have integrated the notion of 'state' into stream processing, for use cases exactly like this.

You can do anything with PipelineDB that you can do with PostgreSQL 9.4, but with the addition of continuous SQL queries, sliding windows, probabilistic data structures, uniques counting, and stream-table JOINs (what you're looking for here, I believe.)

link