|
|
|
|
|
by parhamn
650 days ago
|
|
Is there a "html reducer" out there? I've been considering writing one. If you take a page's source it's going to be 90% garbage tokens -- random JS, ads, unnecessary properties, aggressive nesting for layout rendering, etc. I feel like if you used a dom parser to walk and only keep nodes with text, the html structure and the necessary tag properties (class/id only maybe?) you'd have significant savings. Perhaps the xpath thing might work better too. You can even even drop necessary symbols and represent it as a indented text file. We use readability for things like this but you lose the dom structure and their quality reduces with JS heavy websites and pages with actions like "continue reading" which expand the text. Whats the gold standard for something like this? |
|