Hacker News new | ask | show | jobs
by frankfrank13 485 days ago
I worked on a project converting word docs to markdown so they could more easily be ingested into an LLM, one issue was that context windows used to be very short, so we would basically split on `\n#` to get sections, but this turns into a whole thing where you have to make guesses about which header level is appropriate to split at, and then you turn each section into a separate chunk in FAISS. Anyways we ended up using HTML instead of MD but theres so much tooling for traversing HTML and not MD. This would have been helpful for that