Hacker News new | ask | show | jobs
by brendano 4258 days ago
Part of it is genuine differences between online conversational language versus standard written English, like emoticons, Twitter-specific discourse markers, and hard-to-segment compounds or clitic constructions (see the Gimpel and Owoputi papers (2011, 2013) linked on the page, and/or the annotation guidelines document too). Part of it is just that it's easier for humans to annotate the coarse-grained POS tagset, and we didn't have many resources for annotation when we did it.

These things also intersect ... for example, you'd have to figure out how dialectical English verbal auxiliaries like "finna", or the second or so word in "imma", map to PTB tags. It's possible but just takes more work and thinking through the descriptive linguistics and what you want to use it for. Someday I'd like to update the whole thing for a more PTB-like POS tagset, if it can be done well. I feel like Chris Manning's whitepaper on issues in PTB POS data convinced us (well, it convinced me, at least) that it might be a good idea to focus on making high quality tag annotations. (http://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf )