They likely went with this form factor because it allows for the camera to look forward and use that context in its responses. A watch wouldn't easily be able to do this.
You could make a smartwatch with a camera facing up from the screen. The user could bring their palm to their chest (so watchface faces out) to activate it. Then the camera can see forward and the microphone is close to the user.
And then you could do video calls on the same device too.
This has the added benefit of having some recognizable sign that someone is using the camera... which despite proclamations that "the public" is ready to accept being on camera all the time, I'm not convinced is true when it's someone wearing an overt device pointed at you and possibly recording, but you're not quite sure, all the time.
I for one have no particular desire to be part of your "context" (nor the company's training data set) without knowing it.
But how practical is it really? Let's say it's winter, you have your AI Pin on your winter jacket. Then you get inside and take off your jacket naturally. Then you take off your AI Pin and somehow put both parts of it into and onto your sweater? This sounds very cumbersome. A smartwatch just stays on your wrist. You can even take a shower or go swimming with it if you want to. And a smartphone has a screen you can use in many situations – sitting, standing, lying, with the phone on your hand, lying on a table, attached to a stand. All of this is not possible with the AI Pin. It is meant to be attached to your clothing or you can't use its projector. How do you read your emails? How do you read a book? How do you frame a shot? How do you scroll through TikTok? These are all things people do with devices that cost way less than $700 today. And many, many people love to do these things.
If so, they made a big bet. Vision LLMs were literally made this year. Before that, parsing images to get a coherent response is pretty resource intensive and not really reliable at all. Designing the entire device around image capturing for context seems like a very risky approach so I doubt that was their main reason.
And then you could do video calls on the same device too.