Hacker News new | ask | show | jobs
Ask HN: Built a news stories classification engine ... now what?
6 points by baham 5609 days ago
For the past 2 months, I have been working on a news stories classification engine. I believe to have reached a stable stage and the application can be viewed at http://babeligg.com/. I have so far approached it as a technical challenge. Right now I have two interrogations which I submit to the HN community: * Is it possible to run some test suite to independently confirm that the performance are superior? * How can such tool/API be monetized?

Thanks.

4 comments

To answer your first question, you should simply benchmark against the average Mechanical Turk worker's ability to classify links. Build a set of tests from the workers and every time you update your algorithm, you'll need to run against the dataset to see if you've improved anything.

For the second question, your product would be most valuable as an aid to contextual advertising (what ads should I display?), and its possible that you could charge per 1K requests. I have a need for this myself, so I would be happy to be a beta-tester.

I haven't looked at it in detail yet. But first thoughts, lose the name. It's highly forgettable and ugly on the tongue.

EDIT: For a second i was confused, was i supposed to enter some words or a URL. So i tried random words, hoping it would give me stories classified accordingly. Tried "django", and some others. It classifies everything into sports.

Seems to do a decent jobs with some random URLs i threw at it. Seems to get startup news into the Business category, which is fine.

Gets technical articles wrong, but those are tough: eg. http://blog.doughellmann.com/2007/07/pymotw-subprocess.html

Is it possible to run some test suite to independently confirm that the performance are superior?

Just think harder about that question. It raises all sorts of philosophical questions about machine, intelligence, language and meaning.

Of course, there ARE benchmarks, at least one bundled with each classifier tool ;-) but there couldn't be one ideal benchmark, no.

This might give you a start though:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.4...

Would be nice to have some examples ready where it shines