Hacker News new | ask | show | jobs
by jimjimjim 499 days ago
The LLMs might help with sequencing the characters you extract from the page but actually getting the contents is still difficult. A number of times I've come across a page where the letters of the text are glyphs in a custom font with no mapping to ascii or anything similar or even more common, especially with output from CAD, are letters that are made by drawing lines in the shape of letters so there is nothing identifiable to extract and you are left with OCRing the page to double check the results