Hacker News new | ask | show | jobs
by tiburon 2346 days ago
@PaulHoule Thanks for sharing your research paper and classifier, It is interesting. I've question though which I found while going through the top 200 articles picked in your algorithm, would it not be a more efficient classifier if you have the data from all those shared links instead of just the titles or some meta data, like if you would have the web pages crawled and scraped all at once to feed the classifier in realtime, would not that bring more accurate results?

I totally agree about ads in ads, what solution do you think can work at scale for those?

1 comments

@PaulHoule i've seen in the conclusion of your research that you are pointing to classifying to the content of those webpages behind the links, so I guess you are working on it. I think there will be a great improvement on how the classifier works if you have more content to analyse.
Check my profile and send me an email and I'd be glad to talk more.

Here is the progress I've made since then.

After I did that project I spent a year working on text analysis tools for somebody else. Then I was looking for a new job and I made a new version of that software to scrape 1000's of job listings and do a similar classification based on the whole text of job listings which are usually a few paragraphs.

That software has a much better user interface than the old software for adding labels and it's designed to handle "workflow" tasks that have some human and some automated elements.

If I do more work in this area I will probably build on that code. Personally I think the framework for getting training data and putting the model to work is more important than the model itself. (That said, with a good document embedding I think you could get good results with less training data)