| HN Mirror

I can see you've put real thought into your critique, and while I definitely disagree with several conclusions, I appreciate the seriousness of the discussion. Hopefully this is a good faith discussion, and we can keep it that way.

Let me start with the Motte-and-Bailey point, since that seems to be the crux of your argument.

For anyone unfamiliar, a motte-and-bailey fallacy is when someone makes a bold or controversial claim, then retreats to a weaker, safer claim under pressure while pretending the two were always the same. That's simply not what's happening here in the slightest.

The confusion begins with a misreading of the title. Which, in hindsight, I agree should have been clearer so that the work was being critiqued rather than semantics. (Although the paper is clear on this distinction.)

“Post-Transformer Inference” does not mean no transformer, nor does it mean replacement of transformers. It refers to where inference is performed in the pipeline. The transformer remains fully intact and unchanged. It's used exactly as intended. To extract representations. The contribution begins after that point.

The paper is explicit about this throughout:

The transformer is fully used and not replaced.

The compressed heads are task-specific and not general LLM substitutes.

The 224× compression applies to task-specific inference paths, NOT to the base model weights.

There's no shift in scope, no retreat, and no weaker fallback claim. The boundary is fixed and stated clearly.

On HellaSwag and the “4 classes” point, this is simply a category error. HellaSwag is a four-choice benchmark by definition. Advertising four classes describes the label space of the task, not the capacity of the model. Compression here refers to internal representations and compute required for inference, not to the number of output labels. Those are different layers of the system.

The same applies to “CUDA-compatible drop-in.” That phrase refers to integration, not equivalence. It means this work can plug into existing CUDA-based pipelines without requiring teams to rewrite or replace their infrastructure. It absolutely does not claim semantic equivalence to CUDA kernels, nor does it claim GPU replacement. The goal is to extract value without forcing anyone to rebuild their stack. That distinction is intentional and explicit.

You also cited the LessWrong essay, which I'm very familiar with and broadly agree with in spirit. It's a valid warning about vague, unfalsifiable, or scope-shifting claims in LLM-assisted research. That critique applies when claims move or evidence is absent. Here, the claims are narrow, fixed, and empirically evaluated, with code and benchmarks available. Disagree with the results if you want, but that essay just isn't describing this situation at all.

As for the flagging. That's easy. There's nothing mysterious about it. Work that challenges familiar abstractions often gets flagged first for language, not for results. Titles that suggest a different inference boundary tend to trigger skepticism before the experiments are actually read. That doesn't mean the work isn't correct, and it would be wrong to assume that.

Flagging isn't peer review. Real critique points to broken assumptions, flawed metrics, or reproducibility failures.

Again, I will freely admit the title was designed to be punchy, and while it's technically accurate, I can see now how it invites semantic confusion. That is totally fair feedback, and I will refine that framing going forward. That doesn't make the results wrong, nor does it make this a motte-and-bailey.

If you want to talk about the data, the methodology, or where this work is heading next, I'm more than happy to do that. I suspect some of the disagreement here is less about intent and more about where you think the boundary of the system is. Once that clicks, the rest tends to fall into place.