It's actually not really easy for AI, without the agent doing some actual coding itself to reverse engineer the font file, or to take screenshots at different variable font intervals to zone in on the "focused" version of the variable font. All of that being said, the intention (beyond just having fun creating it) was to make it AI "unfriendly" so AI bots doing broad quick reads of it are going to be left with gobbledegook encoded characters.
Most llms can equally engage with text in picture form as text in token form. In fact my initial research on this (later corroborated by actual published papers) indicate that this is a cheap way to save on tokens.
Oh interesting and good to know on the token savings with this technique. My test with claude had it use vision and then programmatically test different variable font input variables (mimicking the user scrub interaction) until it was able to OCR it.
I mean I can't know for sure but I'm pretty sure that by the time the upper layers of the network are reached the lower level networks have already transformed the image tiles into proper position encoded embeddings of the tokens in the words in the image.
As they said in the comment you replied to: "Note that a sufficiently prompted AI agent can definitely read this, so it's not meant to be cryptographically sound, more just unfriendly to the common AI reader!"