Makes me wonder if compute kernels would be a better solution if you're trying to emulate all these rasterizer quirks while being fast enough to be playable.
You need to render a few thousand polygons to a 298x196 buffer. Desmume has a software renderer, and it's more than adequate. The trick is being compatible.