Thanks for the link. I found the regional breakdowns interesting.
That being said, I think this research emphasizes that AAVE is anything but standardized. That's not meant as a pejorative statement: it's just acknowledging that, like most languages in history, AAVE has not gone through a process of codification and standardization to formalize it.
Sure, but formal rules are not equivalent to a prescriptive grammar. AAVE has formal, consistent rules, described by linguists. You can make mistakes in AAVE just as in standard English (see: "African American Vernacular English Is Not Standard English with Mistakes" https://web.stanford.edu/~zwicky/aave-is-not-se-with-mistake... )
like most languages in history, AAVE has not gone through a process of codification and standardization to formalize it
I'm not sure what you're getting at - prescriptive grammars of the codified form you're describing are the products of their political and economic circumstances. There isn't some Hegelian trajectory of linguistic validity, where all variants aspire towards legalism.
I'm looking for formal rules. I'm not saying AAVE doesn't exist, I'm saying its rules (such as they are) are not formalized. Tons of rule books exist for Standard English and part of learning it is to memorize the correct rules. I don't believe any formal equivalent exists for AAVE.
"Formal rules", in the context you've chosen to speak in, are defined by this upstream comment:
> NLP is not very good with standard English yet and usually doesn't generalize from topic to topic. Dialects and other languages - especially those without formal rules - will come when we can deal with standard English.
The rules you're talking about, that get printed in books and studied, are not linguistic rules. Crucially, this means they are not widely observed in printed standard English, which in turn means they can't be relevant to training a language model to understand printed standard English.
The "formality" you seem to want to talk about has no place in this discussion. It is not relevant to any language. gordonguthrie is correct to point out that the assumption lqdc13 is trying to make is false. You are wrong to contradict him using a meaning of "formal rules" that you brought to the conversation yourself. It had a meaning -- a completely unrelated meaning -- before you showed up.
> Crucially, this means they are not widely observed in printed standard English, which in turn means they can't be relevant to training a language model to understand printed standard English.
I agree that they're not widely observed in written English, but they are consistently observed in the WSJ, which was the origin of this entire debate.
As lqdc13 pointed out, NLP still isn't even consistently good at understanding standard English. One could reasonably posit that that's due to the inherent ambiguity and inconsistency of most writing and that focusing on a narrower, standardized document corpus (the WSJ) you could get better initial results. What, exactly, is controversial about that? Do you really think that the language of the WSJ is no more consistent and formalized than the language of Twitter users?
What, then, are your parameters for formal? It has to be in a book, codified and signed off on? If so, aren't these kind of antithetical to AAVE on the face of it?
> It has to be in a book, codified and signed off on?
The point is for rules to be formal then the must be formalized somehow and codified. This could be online or in a book, but the point is that there must be some clear delineation between when the rules are followed and when they are broken. Otherwise what's the meaning of "formal" rules?
> If so, aren't these kind of antithetical to AAVE on the face of it?
Yes. My argument is that AAVE, almost by definition, is an informal dialect without formalized rules.
This is getting rather off topic, but I think it relates to the original point of why NLP might start with Standard English even if you are not biased. A large corpus of Standard English text (such as from the WSJ) will generally be very internally consistent precisely because it follows a set of formal rules codified into a style guide. As there is no such equivalent for AAVE, even gathering a large and internally consistent corpus of AAVE text seems prohibitively difficult. That being said, I do hope researchers are working on gathering text from Twitter to build up new training sets.