That's not quite what I'm talking about. If you look at Gries's 2006 paper (the less formal one, not his formal proof of the algorithm), each node has 2 bits for state and two child pointers.
State 0: Node is white
State 1: Node is grey and child[0] points to parent
State 2: Node is grey and child[1] points to parent
State 3: Node is black and both child pointers are restored
To generalize this algorithm to nodes with k children, you need to either pack ceil(log2(k+1)) bits into object header and tag bits in the first few fields of the object, or else have a white/not-white bit in the object header and have one tag bit in each object field, indicating which pointer has been reversed to point to the parent node. The latter is less complex, but results in O(k^2) time complexity in the repeated scanning of the object to find the currently reversed pointer each time you recurse back up to the node while traversing the graph. To really maximize speed, I suspect the O(k^2) algorithm is used in practice for small k, and the more complex counter spread across tag bits of multiple pointers is used for nodes with larger k.