Hacker News new | ask | show | jobs
by psimm 1825 days ago
I'm in the market for a tool like this. At the moment I'm using Prodigy but interested in other options. Features that I'd be willing to pay for (or rather my employer):

  1 team functionality with multiple user accounts

  2 easy to use workflow for double annotation where each text is annotated by exactly two annotators. The software should make sure that a text is never shown to more than 2 annotators and never shown to the same annotator twice

  3 make it easy to review the 2 versions and solve conflicts

  4 smarter alternative to review would be a warning system that identifies annotations that may have errors (because a model trained on the other data predicts a different result) and automatically flags it for review by another annotator

  5 stats on the annotators: speed, accuracy, statistics on how frequently they assign different labels to detect potential misunderstandings of the annotation schema

  6 GUI with overview of all annotation datasets, with stats like % finished annotating (with stages for double annotation and review), the types of annotation done, frequencies of labels to detect imbalances

 7 functions to mass-edit the annotations, like renaming or removing an entity type
Another thing I'd be interested in is some integration with a third party annotation provider. There are companies that offer annotation as a service and it's also available on Google Cloud and AWS. Having that integrated into an annotation tool would make it very easy to get large amounts of well annotated training material.

But finally, and much more importantly: The workflow for annotators has to be perfected first, so they can work as efficiently and consistently as possible. Getting this right is more important to me than any of the other features I listed.

1 comments

I appreciate the insight, that's super helpful.

> team functionality with multiple user accounts

Mind if I ask what sort of team features you make use of with Prodigy? Are there any aspects you feel are lacking? Initial thoughts are that it'd be helpful for teams to be able to set group annotation goals, share docs / annotations / configs, view ongoing sessions, assign annotators to sessions, and view stats on each annotator (as per point 5).

> The software should make sure that a text is never shown to more than 2 annotators and never shown to the same annotator twice

For this I plan to let teams set the threshold for the number of documents that should overlap and the number of annotators a text should be shown to. In some situations it could be useful for there to be some % of overlap for all annotators to help determine the inter-annotator agreement across the entire team.

> The workflow for annotators has to be perfected first

Totally agree. My biggest concern is building out the above on top of an inefficient workflow. That's one of the primary driving forces behind the current re-write of the tool.

Love the smart flagging, mass-edit, and integrated provider ideas!

I use these team features in Prodigy: I start annotation sessions with different session_id and with the feed_overlap flag. I run Prodigy from an EC2 instance that annotators connect to.

The Prodigy team is working on a new version called Prodigy Scale with more team features. I'm looking forward to that release! For now it feels like a hack to use Prodigy in a team.

Inter-annotator agreement is key! You could consider making that highly visible in your tool. It's something that every team should measure and strive to maximize.

For developers who use spaCy in production (like me), I imagine it would be very hard for your tool to come out on top of Prodigy. But there could be an opportunity with price-sensitive hobby users or devs who use a different NLP library.