Hacker News new | ask | show | jobs
by mattashii 897 days ago
How does performance scale (vs pgvector) when you have an index and start loading data in parallel? Or how does this scale vs the to-be-released pgvector 0.5.2?
2 comments

I'm also concerned about these (tested!) errors:

> https://github.com/lanterndata/lantern/blob/040f24253e5a2651...

> Operator <-> can only be used inside of an index

Isn't the use of the distance operator in scan+sort critical for generating the expected/correct result that's needed for validating the recall of an ANN-only index?

Ah, thank you for noticing! We actually have typo in the error message. It actually should be the operator <?> instead of <->.

There's some context on the operator <?> here: https://github.com/lanterndata/lantern?tab=readme-ov-file#a-...

We haven't benchmarked against 0.5.2 yet so I can't share exact numbers. We will benchmark it once it is released.

We think our approach will still significantly outperform pgvector because it does less on your production database.

We generate the index remotely, on a compute-optimized machine, and only use your production database for index copy.

Parallel pgvector would have to use your production database resources to run the compute-intensive HNSW index creation workload.

which version of pgvector are you using for these benchmarks?
We used 0.5.0 for these
It’s not really a fair comparison in that case:

https://x.com/pgvector/status/1711910075416432785?s=46

Do you have the code you used so that we can reproduce these results?

I added an edited note to the bottom of the blog post.

The original post and the experiments were created before pgvector 0.5.1 was out, and we had not realized there was significant work to optimize index creation time in the latest pgvector release.

We reran pgvector benchmarks with pgvector 0.5.1. Now pgvector index creation is on par or 10% faster than lantern on a single core. Lantern still allows 30x faster index creation by leveraging additional cores.

Wiki Pgvector - 36m Lantern - 43m Lantern external indexing (32 CPU): 2m 15s

Sift Pgvector - 12m30s Lantern - 7m Lantern external indexing (32 CPU): 25s

The DB parameters for the above results (both Lantern and pgvector): shared_buffers=12GB maintenance_work_mem=5GB work_mem=2GB

The DB parameters for the previous results were the defaults for both Lantern and pgvector.

Benchmarking was done using psql timing and used a 32CPU/64GB RAM machine (Linode Dedicated 64).

Feel free to reach out if you need anything for benchmarks.

> Feel free to reach out if you need anything for benchmarks.

likewise, feel free to reach out before publishing pgvector benchmarks. i'm sure we will have some tips to make them more impartial