Apache Hudi 1.0 released with secondary indexes for data lakehouses

Y	Hacker News new \| ask \| show \| jobs

	Apache Hudi 1.0 released with secondary indexes for data lakehouses (hudi.apache.org)
	11 points by v5c6 592 days ago

4 comments

redhouse 592 days ago

Really curious of the performance gain we can get from the secondary index, and the cost of it.

link

v5c6 592 days ago

It’s comparable to and depends on selectivity of the query, like any database index. On a 10TB tpc-ds web_sales with 1:150 selectivity, we see an impressive 95% gain.

If the query fetches out most records for e.g, then gains will be lower

link

dunwaldo 592 days ago

Is this too little, too late while Snowflake and Databricks are marketing Iceberg full steam? Maybe Hudi will hang on a little longer than Delta if it builds new things like this?

link

v5c6 592 days ago

An open source community cannot out market big vendors. But can certainly out execute and the judicious engineers will continue making choices based on technical evaluations, to keep it going.

I’d be very surprised if delta goes away, since iceberg still is not feature complete to replace it. Databricks has somewhat of a confusing position now, which is hurting themselves. It’d be interesting to watch.

link

sudha_sakthee 592 days ago

First lakehouse system to introduce secondary index!

link

cloud8bits 592 days ago

Are there good use cases for indexing on data lakes?

link

sudha_sakthee 592 days ago

Faster upserts directly benefit from indices on the write side and the read side benefits from fast lookups.

link

dunwaldo 592 days ago

I thought it was more for writes, but reading here looks like the index will also help reads

link

v5c6 592 days ago

Yes. Indexes are integrated into reads, with (near) standard SQL for managing them

link