The prompt isn't relevant to this question though. The quality of output can be improved with better input but in this case, I am curious about the underlying mechanics in the model that leads to such behavior.
I had a prompt saved which would give full sources per sentence of response. It was useful for one purpose then became annoying and time consuming. I was diagnosing hallucinations and training data issues.
Maybe crafting something to give a full APA or MLA citation and works cited page per response could help.
Perhaps... but "prompt engineering" right now seems like throwing paint at a wall until the black box evaluates to the relative "truth" you were looking for. It's like a stochastic wrench you turn until it serves your intensive porpoises.
Tokens? The way I see these things operating (in my head) is as a hyper-dimensional merge sort which lose there context/bounded-domain during evaluation, leading to something less than the sum of its parts because the weights between tokens correlate linguistic/phonetic relationships--which lose their causal-relationship to the real world.