Hacker News new | ask | show | jobs
by Cactus2018 2095 days ago
> I'm not sure if it does now, but I do know that reading data from tables in a manner that can be easily integrated and scaled within a broader semantic processing system is quite difficult. I'm not as focused on the space as I once was, so I'm not sure if the problem has been well solved yet. If not, I'd say it's a worthy area to invest in a solution.

In R you can read data from tables like this:

    df<-htmltab::htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)
In google sheets

    =ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)
In Python+Pandas

    df=pandas.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)[0]
1 comments

In the problem space of “reading data from tables in a manner that can be easily integrated and scaled within a broader semantic processing system”... I would assume that “reading data from tables” isn’t the hard part.
You would assume correctly. The core issue is that one can't interpret meaning from a table and its values from semantics alone. A table's layout conveys a great deal of meaning.

I remember looking at a couple of systems that would try to do a visual-based zonal tagging of a table, but I think the challenge there was how to logically integrate the zonal tagging into the broader semantic processing of the surrounding text.

Not being able to construe information from tables is a huge stumbling block for semantic and NLP systems for a large number of use cases that incorporate technical content. Automating patent research is one I looked at 6 or 7 years ago and tables tanked the concept. Semantic search over digitized maintenance manuals is another use-case I've wrestled with that's a tough nut to crack if the underlying manuals aren't available in a structured schema.