Hacker News new | ask | show | jobs
by kjqgqkejbfefn 842 days ago
>tree-based approach to organize and summarize text data, capturing both high-level and low-level details.

https://twitter.com/parthsarthi03/status/1753199233241674040

processes documents, organizing content and improving readability by handling sections, paragraphs, links, tables, lists, page continuations, and removing redundancies, watermarks, and applying OCR, with additional support for HTML and other formats through Apache Tika:

https://github.com/nlmatics/nlm-ingestor

1 comments

I don't understand. Why build up text chunks from different, non-contiguous sections?
On the level of paper, not everything is laid out linearly. The main text is often laid out in column, the flow can be be offset with pictures with a caption, additional text can be placed in inserts, etc ...

You need a human eye to figure that out and this is the task nlm-ingestor tackles.

As for the content, semantic contiguity is not always guaranteed. A typical example of this are conversations, where people engage in narrative/argumentative competitions. Topics get nested as the conversation advances, along the lines of "Hey, this remind me of ...". Building up a stack that can be popped once subtopics have been exhausted: "To get back to the topic of ...".

This is explored at length by Kebrat-Orecchioni in:

https://www.cambridge.org/core/journals/language-in-society/...

And an explanation is offered by Dessalles in:

https://telecom-paris.hal.science/hal-03814068/document

If those non-contiguous sections share similar semantic/other meaning, it can make sense from a search perspective to group them?
it starts to look like a graph problem