| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by munro 905 days ago

> 500ms is an acceptable latency for a search that runs offline without network calls.

Definitely disagree, having fast search makes a huge difference. Apple Photo's search is fast. Especially since CLIP embeddings aren't perfect, you may need to try a few different keywords.

> For 100,000 embeddings, writing took 6.2 seconds, reading 12.6 seconds and the disk space consumed dropped to 440 MB.

Something sounds so wrong here, I read/write a 100-500 GiBs of data from Python, and it's much faster than this.

In fact I'm seeing ~1.3s to write, and ~700ms to read from trivial Python:

    import pickle
    import sqlite3

    import numpy as np

    IMAGES = 100_000
    EMBEDDING_DIMENSIONS = 1024

    ### write
    db = sqlite3.connect("images.db")
    db.execute("CREATE TABLE IF NOT EXISTS image_embeddings (id INTEGER PRIMARY KEY, embedding BLOB)")

    # generate synthetic embeddings
    embeddings = np.random.normal(size=(IMAGES, EMBEDDING_DIMENSIONS))  # 1.61 s

    # serialize embeddings for storage
    serialized_embeddings = [(pickle.dumps(embedding),) for embedding in embeddings]  # 626 ms

    # insert into db
    db.executemany("INSERT INTO image_embeddings (embedding) VALUES (?)", serialized_embeddings)  # 679 ms
    db.commit()  # 95.3 ms
    db.close()

    #### read
    db = sqlite3.connect("images.db")
    serialized_embeddings = db.execute("SELECT embedding FROM image_embeddings").fetchall()  # 370 ms
    embeddings = [pickle.loads(x[0]) for x in serialized_embeddings]  # 322 ms

1 comments

cogman10 905 days ago

For reads from the original code, part of the problem is they are doing limit/offset (rather than a fetch all). Even if they didn't want to read all, it'd be a lot faster to remove the offset and instead query based on the id. Limit/offset often result in really expensive query plans.

link