Hacker News new | ask | show | jobs
by augustiner 6307 days ago
The most fundamental problem with natural language search engines is that the "natural language" part is more a limitation than a feature to me. Natural language is meant for people to communicate with other people and not with computers. I believe that a well designed keyword/tag based search combined with factual auto suggestions extracted from formal/semantic sources (similar to wikipedia) could be far more efficient for people to use and computers to run.
2 comments

I am in law school and as a law student I spend lots of time searching through past cases. My searching is done almost exclusively through Westlaw, the online database of Thompson West products.

They have many different types of searches, but the two applicable to this discussion are (as they are written on the site) both "terms & connectors" and "natural language." T&C works well using your standard OR/AND/etc. However, natural language works so much better even though you type in the exact same words.

The natural language search returns cases more on point and has one awesome feature: the most relevant text is in red type, set apart from the rest of the case. From the natural language search West is better able to determine what the legal researcher wants and shows it to him.

I spent almost 2 years of law school searching using terms and connectors because I thought the same thing as you do. But I recently converted when I realized West returns better results from their natural language search.

@micks56 - re your legal search using natural language -

Are you able to compare, say, Wests results to those of a pure google text search on the keyword terms?

[ To do that you'd need some example of large legal texts fully online and thus indexed by google - I dont know if that exists ]

Its sometimes hard to discern the value of the tech versus the quality of the implementation + usability factors - but your observations are interesting. I wonder how search on medical information compares...

gord.

I am trying to think of a test that can be run on my West search engine and Google. West's legal resources dwarfs Google's. Google might as well be considered non-existent in the area compared to West or LexisNexis. That is just what those two companies do. They have people that enter cases into databases as they become available. Google just doesn't do that.

I haven't thought of a fair test to run yet. The two engines do different things. My West can search case decisions, statutes, administrative codes, briefs filed to the court, secondary sources (sort of the research paper of the legal field) and the news.

So I tried doing a search on the news only. I searched "ycombinator" and the results returned are news articles only, whereas on Google someone probably wants and gets the YC home page, this site, or the actual function. None of those show up on the West site.

Then I ran a search of these terms on each (I didn't enter quotes on the actual searches): "massachusetts custody modification"

On Westlaw, I get cases, and statutes on point. With extra terms I will easily get to cases that deal with my specific issue. On Google, the first link is a divorce resource site and the rest are for lawyers.

Searching statutes might work. But the main reason statutes search well on Google is the Cornell Law site. The quality of results for statutes is probably a bigger testament to them and their cataloging efforts.

I would say both search engines hit their target markets well. Most people searching "massachusetts custody modification" don't want 20 decisions of the Mass SJC. And people searching the same on Westlaw don't want attorneys. Google is much much faster though. It returns in a fraction of a second. Westlaw took about 12 seconds to return 10,000 hits. First three hits were decided yesterday, which is pretty cool.

There is a group of people creating an open legal database. I can't remember its name. I think they are based in the San Fran area. I think it was started by some hacker that worked on opening up some other government data and is now on the court system. I have the bookmark buried somewhere and of course can't find it. Does anyone know which one I am talking about? We could maybe test that database versus the commercial West one.

thanks for the write up.. interesting to see how things develop in the real world outside your own domain.

I'm surprised the big G hasn't just paid some money to get that data, given their plan to scan all the worlds books.

I wonder what percent of all text is legal or medical.

I doubt West, LexisNexis, or any other legal aggregator will sell the information to Google. Those companies make a lot of money selling it to lawyers on a monthly subscription basis. They also do some value-add to the materials. What I see on West or LexisNexis is more than just the publicly available decision. West and Lexis employ lawyers to create summaries and other helpful things for the legal researcher.

There certainly is a lot of legal text. Lawyers certainly are good at creating volumes of paper. For example, the Supreme Court just decided a case, Wyeth v. Levine. It will be recorded in volume 555. So to date the Supreme Court decisions have filled 554 volumes of 1000 pages each. And that is just one court. Every state court, state appeals court, and state supreme court, federal court, land court, etc has similar volumes and page counts.

And all of this is just the primary sources. Once you add secondary sources, aka books and papers written by learned scholars on individual topics or cases, the number of books and pages increase by orders of magnitude. And we still haven't archived any statutes (those go on forever, for each state) or any administrative law. And each one of those has comment sections that go on for pages whereas the actual rule is only a paragraph.

I wonder what percentage this is, too. I bet it is still extremely small compared to what the rest of the world has produced. There are so few law writers when compared to all other writers.

Thats a lot of text. The few patents Ive read strike me as quite verbose. I was quite amazed at what was patentable, and how loosely described {ephemeral!} the descriptions were. I'm not suggesting all legal text is as sparse in information.

We could certainly do with a better text search for patents.. but I wonder if thats possible unless a form of restricted prose is used that makes the text less obtuse/verbose.

Maybe an algorithm can reduce the common legal motifs and replace them with shorter versions thus refactoring legal-speak into human-readable prose on which text search can be effective.

[ For some reason this reminds me of the law student drama series 'the paper chase'. ]

How well is the information hyper-linked? Presumably one paper references many previous rulings, and youd jump around a lot in researching issues.

Hi micks56, I would like to ask someone who has a grounding in both law and technology some questions not directly related to this discussion but to software for lawyers in general. Your profile doesn't have an email ID. Care to email me at heuristix at gmail or reply back with your email id?
I just sent you an email. I am happy to answer any questions that you or anyone else has.
Thank you! Got the mail. I'll write up my question(s) today evening and email you.
I think I just said what you just said.. but then I read your post.

Is this something you'd enjoy hacking on?