Hacker News new | ask | show | jobs
by bigpeach 1227 days ago
> Training models on top works great, even averaging 2-3 phrases works great for making a more precise query.

What does this mean?

1 comments

If you average a query embedding with that of similar terms, you get a more focused ranking. Averaging amplifies the common component of the vectors and diminishes the spurious components.
What do you mean by "average a query embedding"?
Presumably this operation:

  def average_embeddings(embeddings):
    return [
      sum([embedding[x] for embedding in embeddings]) / len(embeddings)
      for x in range(len(embeddings))
    ]
Just finding the mid-point in N-dimensional space between all points. Although I believe OpenAI says these are directions rather than locations. Averaging makes sure you don't accidentally pick a target vector that happens to be on the fringes of your desired embedding space.

So if you're querying a document for predatory birds and you are using the string "Eagles" as your target you will be accidentally digging into embedding space for sports teams (go birds). But if your target is `average_embeddings([eagles_embedding, hawks_embedding, falcons_embedding])` you'll end up with less noise.

When you input a query, you get a numerical representation output. If you have 3-5 similar queries (each phrased a bit differently), they all have slightly different outputs. If you average them all together, it combines them in a way that improves results.