Hacker News new | ask | show | jobs
by kaycebasques 691 days ago
I spent a lot of time thinking about how to manage embeddings for docs sites. This is basically the same solution that I landed on but never got around to shipping as a general-purpose product.

A key question that the docs should answer (and perhaps the "How it works" page too): chunking. You generate an embedding for the entire page? Or do you generate embeddings for sections? And what's the size limit per page? Some of our docs pages have thousands of words per page. I'm doubtful you can ingest all that, let alone whether the embedding would be that useful in practice.

1 comments

I chunk pages and generate embeddings for each chunk. So there's no real size limit per page.
The more detail, the better. If `<section>` elements are found you chunk those? Do you do it recursively or do you stop after a certain level? And when section elements don't exist, you use `<h1>`, `<h2>`, etc. to infer logical chunks?
Having looked at a lot of HTMLs, I noticed that sections are not really the default. I rely on headings (h1, h2, ...) to chunk each pages. Each chunk has its heading hierarchy attached to it. There are a lot of optimizations that could be done at that level.
i'm just guessing but i would think following whatever semantics leads to the highest search rank in google's algorithm would be what you're most likely to find out in the wild.