|
|
|
|
|
by sr3d
3633 days ago
|
|
This is an interesting library to watch for sure. Personally I have built many scrapers and extractors to be used in house and I have spent many hours on tweaking Readability JS and I know how complicated and hard-to-test the code is. Seeing how Fathom does its job is cool -- it takes care of a lot of the low level, bookkeeping parts so that all you need to do is to focus on tweaking the ranking formula. I'm not surprised if in the future we will have a shared repo containing "recipes" to parse pages, and slap on a nice UI with DOM traversal then we'd have a Kimono-like app for parsing contents. |
|
https://github.com/mozilla/page-metadata-parser
This repo is exactly what you describe, meant to be a collection of 'recipes' or 'rules' for extracting various forms of metadata from pages. It's very early in its infancy but we are nearing deploying a first version of this to users via Test Pilot:
https://testpilot.firefox.com/
I would love feedback or contributions!