Show HN: Rules-based labelling tool for NLP

Y	Hacker News new \| ask \| show \| jobs

	Show HN: Rules-based labelling tool for NLP (github.com)
	55 points by dataqa 1736 days ago

5 comments

dataqa 1736 days ago

Hey HN!

After working in ML for more than a decade, I became frustrated over time with the lack of tools to create baselines using simple rules and heuristics. It is well known that most business problems can achieve decent baselines using only heuristics. So this is why I have just open-sourced DataQA, a rules-based labelling tool for NLP:

  - Quick labelling: You can create complex rules using regular expressions to help you label your text faster.
  - Search engine: DataQA also ships with a search engine (local elasticsearch database) so you can  search your documents.
  - Easy installation: Only need to install a single python package!
  - Easy use: upload your data as csv files.
  - Privacy: No data ever leaves your computer.

I'm hoping to get some feedback, and I'm open to hear about feature requests or ideas for extensions. I will be around to answer questions.

link

teruakohatu 1735 days ago

Looks great. I can't try it right now, but looking at the documentation I would suggest an alternative to CSV upload.

For larger documents CSV can be annoying. The line breaks needs to be escaped and commas need to be escaped. Pointing the application to a folder containing a corpus of text files is much easier.

link

dataqa 1735 days ago

Thanks for the tip! The tool can deal with larger documents, so you're right that using a folder would be better in that case.

link

steve_g 1734 days ago

It looks like this tool is intended to label _documents_ using rules/heuristics. That seems useful.

My desired use case is to label words or phrases (named entity recognition) - specifically for chemicals. It seems like this tool isn't designed for that. Am I understanding correctly?

link

dataqa 1734 days ago

There is a tutorial on the site where you use rules to extract mentions of side effects from forum posts: https://dataqa.ai/docs/tutorials/medical_side_effects/ner_me.... You can use this tool for NER.

link

sbdmmg 1735 days ago

Hi! Interesting project, congrats! How does it compare to https://calmcode.io/human-learn/introduction.html ?

link

cantdutchthis 1735 days ago

I'm the maintainer of human-learn. While I cannot speak on behalf of the maintainer of DataQA, but it does seem like this tool is more specific to the entity detection use-case. I imagine it has better support for tools that deal with text.

Human-Learn, on the other hand, is more focussed on tabular data and the scikit-learn stack. Since scikit-learn doesn't have a convenient pipeline for entity detection, I would certainly recommend exploring other tools than human-learn for this use-case.

I've not used DataQA before, but figured it'd be relevant to share my input.

link

dataqa 1735 days ago

Thanks for sharing! It looks very interesting. From a brief check, they do not seem to be UI-based like dataqa (although you can use it in a notebook), they do not offer a search engine and they are probably one level of abstraction below dataqa. You can do some of the stuff dataqa does but would need to code. Some of the rules offered by dataqa rely on complex operations with regular expressions, and are not so easy to program yourself.

link

cantdutchthis 1734 days ago

Oh, yeah, for sure the target audience is python devs with human-learn. There are user-interfac-y things but those are accessed from a Jupyter notebook.

link

schleck8 1734 days ago

Awesome!

link