Hacker News new | ask | show | jobs
by theYipster 2095 days ago
Having worked with the "original" Watson, I saw first hand how the system stumbled upon a particularly stupid but hard problem as it tried to scale.

In 2014, I saw a demo of the original Discovery Advisor, which was at the time the closest commercial equivalent to the "Jeopardy system." This demo took in Wikipedia as a corpus, and a question was asked: "what country produced the greatest amount of wheat in 2012?" The system returned a list of countries as answers, so it wasn't quite nonsensical, but it was clear the answers were incorrect. The answers were countries like "England," "Norway," or "Zimbabwe." This system also returned passages from Wikipedia as supporting evidence, but the passages weren't about wheat production. Instead, they were about quotes that contained the word wheat... such as "let's cut the wheat from the chaff."

So of course, some smart-alec in the room Googles the same question, and this was before Google had the ability to return factual answers to factual questions, so instead we got a list of web results. The top result, interestingly, was a Wikipedia article titled "Wheat Production by Country." Opening that article presented a table that clearly showed that China produced the greatest amount of wheat in 2012.

Unfortunately, that Watson system at the time didn't read information from tables. I'm not sure if it does now, but I do know that reading data from tables in a manner that can be easily integrated and scaled within a broader semantic processing system is quite difficult. I'm not as focused on the space as I once was, so I'm not sure if the problem has been well solved yet. If not, I'd say it's a worthy area to invest in a solution.

7 comments

> I do know that reading data from tables in a manner that can be easily integrated and scaled within a broader semantic processing system is quite difficult. I'm not as focused on the space as I once was, so I'm not sure if the problem has been well solved yet.

I saw a presentation on this paper at SIGKDD this year. https://dl.acm.org/doi/10.1145/3394486.3406468 "Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web"

This isn't a solved problem but is something that is very actively researched.

Google's TAPAS system deals with natural language queries on tabular data:

https://ai.googleblog.com/2020/04/using-neural-networks-to-f...

There are other strands of research too - just finding which tables are relevant to a query is a real problem.

On the topic of Watson... I just really want Chef Watson back.
WolframAlpha actually has this data but doesn’t understand the question

what country produced the greatest amount of wheat in 2012

If you ask the suggested

country produced most wheat

You do get the table you talk about.

Wikipedia just released an API for Tables, that should help?
Do you have a link to that? Google is failing me.
I agree.
> I'm not sure if it does now, but I do know that reading data from tables in a manner that can be easily integrated and scaled within a broader semantic processing system is quite difficult. I'm not as focused on the space as I once was, so I'm not sure if the problem has been well solved yet. If not, I'd say it's a worthy area to invest in a solution.

In R you can read data from tables like this:

    df<-htmltab::htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)
In google sheets

    =ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)
In Python+Pandas

    df=pandas.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)[0]
In the problem space of “reading data from tables in a manner that can be easily integrated and scaled within a broader semantic processing system”... I would assume that “reading data from tables” isn’t the hard part.
You would assume correctly. The core issue is that one can't interpret meaning from a table and its values from semantics alone. A table's layout conveys a great deal of meaning.

I remember looking at a couple of systems that would try to do a visual-based zonal tagging of a table, but I think the challenge there was how to logically integrate the zonal tagging into the broader semantic processing of the surrounding text.

Not being able to construe information from tables is a huge stumbling block for semantic and NLP systems for a large number of use cases that incorporate technical content. Automating patent research is one I looked at 6 or 7 years ago and tables tanked the concept. Semantic search over digitized maintenance manuals is another use-case I've wrestled with that's a tough nut to crack if the underlying manuals aren't available in a structured schema.