You're right, and it's the main gap right now. We test against xterm-headless with thousands of property-based test iterations, but that's one terminal. I've been developing against a handful of others (Windows Terminal, Mac's Terminal, Ghostty) and real terminals disagree on more than you'd expect. Cross-terminal conformance testing is what I'm focused on next.
Do you have any ideas on how you could automatically test how well different features render in different terminals and on different OSs?