| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by manlio 4088 days ago
	> I've pretty much determined that some degree of manual review will be needed You're spot on with everything. I did a lot of manual review and the site already filters out "NO REMOTE", "REMOTE no", "Remote not" and "No Remote" entries. I did spot the "Remote work isn’t an option" post, but I decided I'm not going to write that kind of completely ad-hoc filtering rules, it's just ugly.

4 comments

louthy 4088 days ago

You could break the text up into sentences [1] and do sentiment analysis [2] on the sentences with 'remote' in. Then flag based on that.

[1] https://opennlp.apache.org/documentation/1.5.3/manual/opennl...

[2] http://nlp.stanford.edu/sentiment/

link

bbcbasic 4088 days ago

Wikify it.

Let users can log in and change the remote/non-remote status (and other attributes).

Have some kind of trust system (could be linked to HN points or whatever).

(Even better if the YC guys made a custom job board where you fill in a form with all the details so there is no inconsistency.)

link

davidw 4088 days ago

Or you could hire people to do it via oDesk or Mechanical Turk. Not so interesting technically, but it's a job people are good at.

link

fudged71 4088 days ago

Hire people for cheap to help people be hired for $$$, with no reward for the upsell. Brilliant! :)

link

nl 4088 days ago

Sentiment analysis probably isn't the right option here, though it may work.

I think a combination of dependency parsing[1] and regex is the way to go.

regex examples: "Remote: No", "No remote please"

Dependency parsing examples: ""Remote work isn’t an option", "Remote work will not be considered"

[1] look for negation in the parse tree using something like http://demo.ark.cs.cmu.edu/parse?sentence=Remote%20work%20is...

link

WalterGR 4088 days ago

Sentence segmentation and sentiment analysis may be overkill.

N-grams + Naive Bayes is potentially Good Enough.

link

manlio 4088 days ago

All these strategies are interesting, but I'm afraid we are over-engineering the problem here. The pretty simplistic strategy I'm using now is basically just pattern matching, and so far I had only 4 misplaced posts out of the 840 for April alone: that is < 0.5%. And it's blazing fast! I can rebuild the entire db in less then 30 seconds.

Given these number I believe pretty much everything more complicated than that would be a total overkill... Good food for thoughts though!

link

jaggederest 4088 days ago

I just manually curate in these cases. HN hiring threads don't ever exceed a level where 0.5% manual review would be onerous.

link

Someone 4088 days ago

I think you will need 100% manual review to find those 0.5%

link

bmh100 4087 days ago

In my experience with data quality management, manual translation of these edge cases is not pleasant. Yet it can be very valuable. It's a bit like "online learning" in machine learning - each time an error is found, you provide the correct answer. Yes, you might end up with a long array of phrases/regexes to check against. However, it scales just right for the amount of data you have and provides high quality results.

link

khoury 4088 days ago

> REMOTE no

"REMOTE no problem!" :) Just kidding. Great job.

link