Hacker News new | ask | show | jobs
by manlio 4088 days ago
> I've pretty much determined that some degree of manual review will be needed

You're spot on with everything. I did a lot of manual review and the site already filters out "NO REMOTE", "REMOTE no", "Remote not" and "No Remote" entries. I did spot the "Remote work isn’t an option" post, but I decided I'm not going to write that kind of completely ad-hoc filtering rules, it's just ugly.

4 comments

You could break the text up into sentences [1] and do sentiment analysis [2] on the sentences with 'remote' in. Then flag based on that.

[1] https://opennlp.apache.org/documentation/1.5.3/manual/opennl...

[2] http://nlp.stanford.edu/sentiment/

Wikify it.

Let users can log in and change the remote/non-remote status (and other attributes).

Have some kind of trust system (could be linked to HN points or whatever).

(Even better if the YC guys made a custom job board where you fill in a form with all the details so there is no inconsistency.)

Or you could hire people to do it via oDesk or Mechanical Turk. Not so interesting technically, but it's a job people are good at.
Hire people for cheap to help people be hired for $$$, with no reward for the upsell. Brilliant! :)
Sentiment analysis probably isn't the right option here, though it may work.

I think a combination of dependency parsing[1] and regex is the way to go.

regex examples: "Remote: No", "No remote please"

Dependency parsing examples: ""Remote work isn’t an option", "Remote work will not be considered"

[1] look for negation in the parse tree using something like http://demo.ark.cs.cmu.edu/parse?sentence=Remote%20work%20is...

Sentence segmentation and sentiment analysis may be overkill.

N-grams + Naive Bayes is potentially Good Enough.

All these strategies are interesting, but I'm afraid we are over-engineering the problem here. The pretty simplistic strategy I'm using now is basically just pattern matching, and so far I had only 4 misplaced posts out of the 840 for April alone: that is < 0.5%. And it's blazing fast! I can rebuild the entire db in less then 30 seconds.

Given these number I believe pretty much everything more complicated than that would be a total overkill... Good food for thoughts though!

I just manually curate in these cases. HN hiring threads don't ever exceed a level where 0.5% manual review would be onerous.
I think you will need 100% manual review to find those 0.5%
In my experience with data quality management, manual translation of these edge cases is not pleasant. Yet it can be very valuable. It's a bit like "online learning" in machine learning - each time an error is found, you provide the correct answer. Yes, you might end up with a long array of phrases/regexes to check against. However, it scales just right for the amount of data you have and provides high quality results.
> REMOTE no

"REMOTE no problem!" :) Just kidding. Great job.