Yes, I wondered whether "referring" had some special meaning, since the way they seem to use it suggests the word reference would normally be more appropriate there (unless it's a special meaning that warrants the different word).
I'm just inferring myself, but I believe it's referring to discussing things in the foreground / background or in a specific location in the provided image (such as top right, behind the tree, etc) in user queries.
It sounds like the "region inputs" are raster or vector inputs. So I'm imagining highlighting a region of the photo with my finger and having it tell me "that's the Duomo in Florence."