Hacker News new | ask | show | jobs
Apache Hudi 1.0 released with secondary indexes for data lakehouses (hudi.apache.org)
11 points by v5c6 546 days ago
4 comments

Really curious of the performance gain we can get from the secondary index, and the cost of it.
It’s comparable to and depends on selectivity of the query, like any database index. On a 10TB tpc-ds web_sales with 1:150 selectivity, we see an impressive 95% gain.

If the query fetches out most records for e.g, then gains will be lower

Is this too little, too late while Snowflake and Databricks are marketing Iceberg full steam? Maybe Hudi will hang on a little longer than Delta if it builds new things like this?
An open source community cannot out market big vendors. But can certainly out execute and the judicious engineers will continue making choices based on technical evaluations, to keep it going.

I’d be very surprised if delta goes away, since iceberg still is not feature complete to replace it. Databricks has somewhat of a confusing position now, which is hurting themselves. It’d be interesting to watch.

First lakehouse system to introduce secondary index!
Are there good use cases for indexing on data lakes?
Faster upserts directly benefit from indices on the write side and the read side benefits from fast lookups.
I thought it was more for writes, but reading here looks like the index will also help reads
Yes. Indexes are integrated into reads, with (near) standard SQL for managing them