| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dunham 1610 days ago
	There was a project out of MIT CSAIL back in 2006 that did automated extraction of tabular data from web pages. e.g. product lists on a store site. It recognized pagination and looked for a sequence repeated DOM structures (and what varied in them) to identify the items. You might find it interesting: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.90....

1 comments

topcat31 1610 days ago

"We propose that web sites can be similarly augmented with other sophisticated data-centric functionality, giving users new benefits over the existing Web." - gonna check this paper out!

Reminds me also of this amazing project that also deals in structured data and tables: https://www.geoffreylitt.com/wildcard/

link