| HN Mirror

The idea of "agency" I have in mind is simply the option to take action at any point in time.

I think the contradiction you see is that the model would have to form a completion to the external input it receives. I'm suggesting that the model would have many inputs: one would be the typical input stream, just as LLMs see, but another would be its own internal recent vectors, akin to a recent stream of thought. A "mode" is not built in to the model; at each token point, it can output whatever vector it wants, and one choice is to output the special "<listening>" token, which means it's not talking. So the "mode" idea is a hoped-for emergent behavior.

Some more details on using two input streams:

All of the input vectors (internal + external), taken together, are available to work with. It may help to think in terms of the typical transformer architecture, where tokens mostly become a set of vectors, and the original order of the words are attached as positional information. In other words, transformers don't really see a list of words, but a set of vectors, and the position info of each token becomes a tag attached to each vector.

So it's not so hard to merge together two input streams. They can become one big set of vectors, still tagged with position information, but now also tagged as either "internal" or "external" for the source.