Exactly. Memes are funny because they make meta references that are culturally relevant or simply attach absurd bottom lines. It's highly unlikely a deep neural network can model anything like that.
Considering most deep learning results are interpreted as absurd/bizarre, I don't think the machine will have much difficulty intentionally or unintentionally emulating meme culture.
I think the image needs to be an input somehow. I imagine running an image classifier (e.g., YOLO9000) to extract “pretrained” features and making those values inputs into a modified LSTM could allow learning to synthesize text and perception. I’d suggest learning new image embeddings (training a neural network to extract image features from scratch), but it’d be difficult to get enough images/enough different images.