Hacker News new | ask | show | jobs
by mattashii 812 days ago
> Considering that embedding vectors represent a lossy compression of the original text or images, is achieving a 100% recall necessary?

For the app, maybe not. But as a database absolutist, I think you must be able to dump all rows of a table with

    WITH
      limit_result AS (SELECT *, {similarity} AS metric FROM table ORDER BY {similarity} ASC LIMIT 10),
      dist AS (SELECT MAX(metric) AS max_m FROM limit_result)
    SELECT *, {similarity} AS metric FROM table, dist WHERE {similarity} > dist.max_m
    UNION ALL
    SELECT * FROM limit_result
... assuming that the ordered values are unique across the table and fully sortable

A recall of <100% may skip some rows in the limit_result, which then also won't show up in the main table's scan result, thus potentially corrupting a data dump process that uses sorted output.