I wish the page were just the prompt they used to generate the article. I like LLMs as much as the next person, but we don't really need two intermediate LLM layers (expand and summarise) between your brain and mine.
Edit: the author's comment below is dead, so I'll reply here: The tape and general effort is great, it's the overused LLM-style intro above that that grates. LLM writing is now like the Bootstrap of old, it's so overused that it's tedious to read.
fair enough, here are the actual fixes from the codebase with the tape examples they target:
arithmetic (Q119): benjamin buys 5 books at $20, 3 at $30, 2 at $45. model writes "$245" first line then self-corrects to $280. fix: model writes a python expression, subprocess evals it, answer comes back deterministic.
python
code_response = generate_response(messages, temperature=0.2)
code = _extract_python_code(code_response)
ok, out = _run_python_sandboxed(code, timeout=8)
if ok:
return _wrap_computed_answer(user_message, out)
return None # fallback to raw generation
logic (Q104): "david has three sisters, each has one brother." model writes "that brother is david" in its reasoning then ships "one brother." correct answer: zero. fix: model writes Z3 constraints or python enumeration, solver returns the deterministic answer.
persona break (Q93): doctor roleplay, patient mentions pregnancy. model drops character: "I am an AI, not a licensed medical professional." fix: regex scan, regen once with stronger persona anchor.
python
_IDENTITY_LEAK_PHRASES = [
"don't have a body", "not a person", "not human",
"as a language model", "as an ai", "i'm a program",
]
if any(phrase in response.lower() for phrase in _IDENTITY_LEAK_PHRASES):
messages[-1]["content"][0]["text"] += (
"\nCRITICAL: Stay in character. Never reference your nature."
)
response = generate_response(messages, *params)
self-correction artifacts (Q111, Q114, Q119): model writes "Wait, let me recheck" or "Corrected Answer:" inline. right answer, messy output. fix: regex for correction markers, strip the draft, ship the clean tail.
def strip_corrections(response):
for marker in CORRECTION_MARKERS:
match = re.search(marker, response)
if match:
return response[match.end():].strip()
return response
constraint drift (Q87): "four-word sentences" nailed 5/17 then drifted. Q99, "<10 lines" shipped 20-line poems twice. fix: draft, verify each constraint against the original prompt, refine only the failures. three passes.
python
def execute_rewrite_with_verify(user_message):
draft = generate_response(draft_msgs) # pass 1: draft
verdict = generate_response(verify_msgs) # pass 2: check each requirement
if "PASS" in verdict:
return draft
refined = generate_response(refine_msgs) # pass 3: fix only failures
return refined
every one of these maps to a specific question in the tape. the full production code with all implementations is in the article. everything is open: seqpu.com/CPUsArentDead
Edit: the author's comment below is dead, so I'll reply here: The tape and general effort is great, it's the overused LLM-style intro above that that grates. LLM writing is now like the Bootstrap of old, it's so overused that it's tedious to read.