They do say the dataset is for pretraining. If your application needs the ability to identify humans, then your fine-tuning dataset would include humans.
This seems totally appropriate for pretraining on natural images IMO.
Wouldn't your pretrained model be lacking in neurons that react to human features though? Seems like a poor choice if your ultimate goal is human images.