Hacker News new | ask | show | jobs
by syntaxfree 747 days ago
> delve
3 comments

Had the exact same thought after reading the abstract… FWIW, delve only appears in the abstract. Having not read the rest of the paper yet, I might give the authors the benefit of the doubt that they used an LLM to summarize their findings for the abstract, but didn't abuse an LLM in writing the entire paper.
Putting aside the possibility that they just happened to use the word “delve,” IMO we still have to figure out the convention for this sort of thing. I don’t particularly value the time scientists spend writing the prose around their ideas, the ideas themselves are the valuable part.

One possibility, for example, could be journals allow AI written submissions but also require and distribute the prompts. Then we could just read the prompts and be spared stuff like the passive voice dance.

They probably abused a compiler to generate their program instead of writing it in assembly.

Soon AI will turn a chickenscrath of notes into a wonderful email. And then turn it back automatically for the end reader.

We put to much emphasis on the look rather than the substance. People are afraid to send out an email with 2 words: Meeting Friday and instead pad it out with pleasantry and detail, context and importance, but none of that really matters.

'Meeting Friday" is not enough information to have me attend the meeting. So I'm not sure what this analogy was supposed to illustrate.
Depends on who it is from I guess.
It's not enough information no matter who it is. If it's someone with enough political, social, or institutional capital you might overlook the annoyance but it still only tells you when. Doesnt say the what the when or the who, all of which have consequences for what I need to do to be prepared.
Exactly what you demonstrated.

‘Meeting Friday’ was the message. You completely ignored the rest. It was just extra padding (intentionally so). Maybe 2 words is too short. But can you honestly tell me that the majority of emails you receive is suscinct and to the point? Or do you simply skim them for highlights and extract what is relevant to you?

That’s really the take away I was trying to get at. People equate quantity to quality far too often. We send way more content than we need to out of fear that someone will equate less with bad.

A compiler yields deterministic results though.
Regardless of the nitty-gritty “determinism” questions; why’s this matter?
llms are also deterministic
No, in most cases the same input will yield a different output.
No, LLMs are deterministic. What you are describing is a randomized seed, which is another input to the LLM. Some interfaces expose this input, and some do not.
Only if you have a non-zero temperature. You have to program in nondeterminism, because otherwise they are 100% deterministic.
Only because most tools provide a randomize seed alongside the input, but you don't have to do that.
in a deterministic way based on seeds
?
Common marker word for LLM-generated text.
A single word is insufficient evidence to conclude that an LLM was used. "Delve" may be low frequency in naturalistic text but there are many words in an article and the chance that some of them will be low-frequency is high. I also checked in my bibliography and found that "delve" is actually not super rare in academic papers including those written before LLMs.
LLM paranoia reaching next levels...
They probably ran it through spell check too.

Can you believe the nerve of some people? Using tools to help write better?

LLM researchers going full circle
With a quick skim, the paper delivers on its promise. It's not a particularly long or difficult paper to follow.

> Causal tracing. The transformer could be viewed as a causal graph that propagates information from the input to the output through a grid of intermediate states, which allows for a variety of causal analyses on its internal computation

> [...] There are in total three steps:

> 1. The normal run records the model’s hidden state activations on a regular input [...]

> 2. In the perturbed run, a slightly perturbed input is fed to the model which changes the prediction, where again the hidden state activations are recorded. [...] Specifically, for the hidden state of interest, we replace the input token at the same position as the state to be a random alternative of the same type (e.g., r1 → r′1) that leads to a different target prediction (e.g., t → t′).

> 3. Intervention. During the normal run, we intervene the state of interest by replacing its activation with its activation in the perturbed run. We then run the remaining computations and measure if the target state (top-1 token through logit lens) is altered. The ratio of such alterations (between 0 and 1) quantitatively characterizes the causal strength between the state of interest and the target.

> The generalizing circuit. [...] The discovered generalizing circuit (i.e., the causal computational pathways after grokking) is illustrated in Figure 4(a). Specifically, we locate a highly interpretable causal graph consisting of states in layer 0, 5, and 8, [...]. Layer 5 splits the circuit into lower and upper layers, where 1) the lower layers retrieve the first-hop fact (h, r1, b) from the input h, r1, store the bridge entity b in S[5, r1], and “delay” the processing of r2 to S[5, r2]; 2) the upper layers retrieve the second-hop fact (b, r2, t) from S[5, r1] and S[5, r2], and store the tail t to the output state S[8, r2].

> What happens during grokking? To understand the underlying mechanism behind grokking, we track the strengths of causal connections and results from logit lens across different model checkpoints during grokking (the “start” of grokking is the point when training performance saturates). We observe two notable amplifications (within the identified graph) that happen during grokking. The first is the causal connection between S[5, r1] and the final prediction t, which is very weak before grokking and grows significantly during grokking. The second is the r2 component of S[5, r2] via logit lens, for which we plot its mean reciprocal rank (MRR). Additionally, we find that the state S[5, r1] has a large component of the bridge entity b throughout grokking. These observations strongly suggest that the model is gradually forming the second hop in the upper layers (5-8) during grokking. This also indicates that, before grokking, the model is very likely mostly memorizing the examples in train_inferred by directly associating (h, r1, r2) with t, without going through the first hop

> Why does grokking happen? These observations suggest a natural explanation of why grokking happens through the lens of circuit efficiency. Specifically, as illustrated above, there exist both a memorizing circuit Cmem and a generalizing circuit Cgen that can fit the training data [...]