Yeah discard use to be slow because it flushes pipelines or mess with branching predictions I don't remember which, I just assumed they "fixed" that by now.
No, it's not either of those, it's just launching useless threads, plus all the down-stream effects of launching useless threads, e.g. if you have blending on, that will block the ROP unit which needs to wait for the threads for a given pixel in-order. If you have depth write on, that will move the write to late-Z.
More vertices is not a big problem, doubling your vertex count is not a big deal, since most GPUs process vertices in groups of 32 or more, and whether multiple instances get packed in the same group depends on the GPU vendor.
More vertices is not a big problem, doubling your vertex count is not a big deal, since most GPUs process vertices in groups of 32 or more, and whether multiple instances get packed in the same group depends on the GPU vendor.