For me the beginner's mistake is: on a CPU any 'if' will cost you a clock cycle (branch prediction not considered), on a GPU it costs you the sync. In best case the syncs will come back over many loops, more often some cores will go out of sync more and more, eventually all other cores will have to wait for the last. Otherwise memory accesses will come blocking and that would cost all cores thousands of cycles.
For me the beginner's mistake is: on a CPU any 'if' will cost you a clock cycle (branch prediction not considered), on a GPU it costs you the sync. In best case the syncs will come back over many loops, more often some cores will go out of sync more and more, eventually all other cores will have to wait for the last. Otherwise memory accesses will come blocking and that would cost all cores thousands of cycles.