|
I've been pondering for a while that we need to move away from layout-based written communication. As in, the need to make things look professionally laid out is an anachronism, and is (very) rarely related to comprehension of the actual content. For example, submissions to regulatory agencies are huge documents; we spend lots of time in (typically) Microsoft Word creating documents that follow a layout tradition. Aside from this time spent (wasted), the downside is that to guarantee that layout for the recipient, the file must be submitted in DOCX or PDF. These formats are then unfriendly if you want to do anything programatically with them, extract raw data, etc. And of course, while LLMs can read such files, there's likely a significant computational overhead vs. a file in a simple machine-readable format (e.g. text, markdown, XML, JSON). --- An alternative approach would be to adopt a very simple 'machine first', or 'content first' format - for example, based on JSON, XML, even HTML - with minimum metadata to support strurcture, intra-document links, and embedding of images. For human comsumption, a simple viewer app would reconstitute the file into something more readable; for machine consumption, the content is already directly available. I'm well aware that such formats already exist - HTML/browsers, or EPUB/readers, for example - the issue is to take the rational step towards adopting such a format in place of the legacy alternatives. I'm hoping that the LLM revolutoion will drive us in just this direction, and that in time, expensive parsing of PDFs is a thing of the past. |