| Coming from a documents format world (publishing), there are a lot of cases like this. In theory it sounds like it should be straightforward but it hinges so much on how well the document is structured underneath the surface. Being that these tools were primarily designed for non-technical users first the priority is in the visual and printed outcome and not the underlying structure. One document can look much the same as another in form—uses black borders to outline fields, similar or same field names, etc, but may be structured entirely differently and that can be a madhouse of frustrating problems. It can be complex enough to write a solution for one specific document source. Writing a universal tool that could take in any form like that would probably be a pretty decent moneymaker. My first intuition, though, would be it may be more successful (though no less simple) to develop a model that can read from the visual of the document rather than parsing it successfully. Open to learning something here, though! |