Hacker News new | ask | show | jobs
by lqhl 813 days ago
MyScaleDB utilizes approximate nearest neighbors (ANN) algorithms such as ScaNN, HNSW, and IVF. As a result, it may not achieve a 100% recall rate. However, depending on the search parameters used, it can attain recall rates of up to 95% or even 99%.

Considering that embedding vectors represent a lossy compression of the original text or images, is achieving a 100% recall necessary? I am interested in understanding its practical implications.

Disclaimer: I am an employee at MyScale.

1 comments

> Considering that embedding vectors represent a lossy compression of the original text or images, is achieving a 100% recall necessary?

For the app, maybe not. But as a database absolutist, I think you must be able to dump all rows of a table with

    WITH
      limit_result AS (SELECT *, {similarity} AS metric FROM table ORDER BY {similarity} ASC LIMIT 10),
      dist AS (SELECT MAX(metric) AS max_m FROM limit_result)
    SELECT *, {similarity} AS metric FROM table, dist WHERE {similarity} > dist.max_m
    UNION ALL
    SELECT * FROM limit_result
... assuming that the ordered values are unique across the table and fully sortable

A recall of <100% may skip some rows in the limit_result, which then also won't show up in the main table's scan result, thus potentially corrupting a data dump process that uses sorted output.