|
|
|
|
|
by dunham
1610 days ago
|
|
There was a project out of MIT CSAIL back in 2006 that did automated extraction of tabular data from web pages. e.g. product lists on a store site. It recognized pagination and looked for a sequence repeated DOM structures (and what varied in them) to identify the items. You might find it interesting: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.90.... |
|
Reminds me also of this amazing project that also deals in structured data and tables: https://www.geoffreylitt.com/wildcard/