Hacker News new | ask | show | jobs
by tiburon 2346 days ago
@PaulHoule i've seen in the conclusion of your research that you are pointing to classifying to the content of those webpages behind the links, so I guess you are working on it. I think there will be a great improvement on how the classifier works if you have more content to analyse.
1 comments

Check my profile and send me an email and I'd be glad to talk more.

Here is the progress I've made since then.

After I did that project I spent a year working on text analysis tools for somebody else. Then I was looking for a new job and I made a new version of that software to scrape 1000's of job listings and do a similar classification based on the whole text of job listings which are usually a few paragraphs.

That software has a much better user interface than the old software for adding labels and it's designed to handle "workflow" tasks that have some human and some automated elements.

If I do more work in this area I will probably build on that code. Personally I think the framework for getting training data and putting the model to work is more important than the model itself. (That said, with a good document embedding I think you could get good results with less training data)