|
|
|
|
|
by chiccomagnus
742 days ago
|
|
Well-written article, missing key considerations: - Titles matter, a lot: if you add the title of the section at the start of each chunk you will get 10x better embeddings and so more accurate results. - The size doesn't matter: It depends on the combination of the layout and semantics of the content. - Avoid garbage in / out: increased context windows don't mean you can put trash inside them. The more good you are at putting relevant information the more precise answers you get. Especially for enterprise-grade solutions, this is so important. There are good emerging API solutions that implement semantic + layout-based chunking, which in my opinion is the best chunking strategy for PDF / Office files (the widest use case scenario for enterprises). |
|