| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jiggawatts 134 days ago

Unless I'm missing something, this uses a simple synchronous for loop:

    for text in texts:
        key = (text, model)
        if key not in pickle_cache:
            pickle_cache[key] = openai_client.create_embedding(text, model=model)
        embeddings.append(pickle_cache[key])
    operations.save_pickle_cache(pickle_cache, pickle_path)
    return embeddings

At the throughput rates I was seeing of one embedding per second, a million comments would take over a week to process!

I had to call the Gemini model with ten comments at a time from eight threads to reach even the paltry 3K rpm rate limit they offer to "Tier 1" customers.

Based on this experience, for real "enterprise" customers I might implement a generic wrapper for Google's Batch API that could handle continuous streaming from a database, chunking it, uploading, and then in parallel checking the status of the pending jobs and streaming the results back into a database.

2 comments

vienneraphael 134 days ago

Hey, idk if that helps but I developed something similar to the wrapper you're mentioning as an open-source python library.

Just plug any async function into the provided async context manager and you get Batch APIs in two lines of code with any existing framework you currently have: https://github.com/vienneraphael/batchling

Let me know if you have any questions, looking forward to having your feedback!

link

jiggawatts 134 days ago

Looks very nice! This is exactly what I was thinking of doing, except that I work mostly with C# in enterprise settings.

Looking at your approach, the equivalent in .NET land would be if the Microsoft.AI.Extensions package added some sort of batch abstraction side-by-side (or on top of) their existing IChatClient or IEmbeddingGenerator interfaces.

link

pjot 134 days ago

Re-reading your comment :) Yes, my demo has just a simple loop when loading the embeddings.

I was replying more towards the latency you mentioned. Because duckdb runs on device, you save yourself the additional round trip network time when comparing similarities.

link

jiggawatts 134 days ago

I was running SQL Server 2025 on my laptop. The source of latency is calling the Google Gemini API to compute the embedding of the query text.

I was hoping to make a demo that searches as you type, but the two second delay makes it more annoying than useful.

Looking at your sample you may be only grouping or categorising based on similarity between comments.

I was experimenting with a question -> answer tool for RAG applications.

link