While it's true this is mostly justification for further investigation, correctly categorizing 12/12 people into 2 categories actually has a p-val of .000244141 = (1/(2^12)), which would easily allow you to reject the null hypothesis of random categorization. The stronger the effect, the fewer samples you need.
We consider n=12 generally underpowered only because many real-world effects are way weaker than the ability this woman demonstrated.
What makes this result meaningless? The probability of her guessing all 12 correctly at random is: 0.0002 (i.e., approximately 0.5^12). So, it is far more statistically significant than many published results.
Meaningless to draw large scale conclusions on. It's a "This is something we should look more closely at" not a "Send this person around the country STAT"
We consider n=12 generally underpowered only because many real-world effects are way weaker than the ability this woman demonstrated.