Hacker News new | ask | show | jobs
by paul_mk1 842 days ago
Fun to see ternary weights making a comeback. This was hot back in 2016 with BinaryConnect and TrueNorth chip from IBM research (disclosure, I was one of the lead chip architects there).

Authors seemed to have missed the history. They should at least cite Binary Connect or Straight Through Estimators (not my work).

Helpful hint to authors: you can get down to 0.68 bits / weight using a similar technique, good chance this will work for LLMs too.

https://arxiv.org/abs/1606.01981

This was a passion project of mine in my last few months at IBM research :).

I am convinced there is a deep connection to understanding why backprop is unreasonably effective, and the result that you can train low precision DNNs; for those note familiar, the technique is to compute the loss wrt to the low precision parameters (eg project to ternary) but apply the gradient to high precision copy of parameters (known as the straight through estimator). This is a biased estimator and there is no theoretical underpinning for why this should work, but in practice it works well.

My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.

8 comments

Thank you. Others on this thread have addressed the citation-trail issues you raise. I just want to tell you how helpful I find your comment about why ternary weights ought to work at all without degrading performance:

> My best guess is that it is encouraging the network to choose good underlying subnetworks to solve the problem, similar to Lottery Ticket Hypothesis. With ternary weights it is just about who connects to who (ie a graph), and not about the individual weight values anymore.

Your guess sounds and feels right to me, even if currently there's no way to express it formally, with the rigor it deserves.

Thank you again for your comment!

IIRC, Hamming's book "Digital Filters" (1989) has a section on FFTs with only the sign of the coefficient being used. It performed surprisingly well.
What is the sign of a complex number? Do you mean the phase?
AFAICT, both the real and imaginary components are from (-1, 0, +1) only. No single sign, but only 8 directions and the center.
You mean Fast Hadamard Transform?
They train using Straight Through Estimator but is cited in the previous BitNet paper. What happen to the TrueNorth Chip? I think investing in specialized hardware for AI is a good bet.
Nice to know there is a trail to relevant citations. I missed the BitNet paper and need to catch up.

Btw TrueNorth project evolved into "NorthPole" chip by the same group, and was recently in the press. From afar NorthPole looks like an interesting design point and leverages on-chip memory (SRAM)--so it's targeting speed and efficiency at the expense of memory density (so perhaps like Groq in some respects). Tbh I haven't followed the field closely after leaving the group.

That’s really interesting to see the breadcrumb trail goes back that far.

So what are the most important insights in this paper compared to what was previously done?

I assume there’s more context to the story and it’s not just that no one thought to apply the concepts to LLM’s until now?

I don't think there is anything conceptually new in this work, other than it is applied to LLMs.

But in fairness, getting these techniques to work at scale is no small feat. In my experience quantization aware training at these low bit depths was always finicky and required a very careful hand. I'd be interested to know if it has become easier to do, now that there are so many more parameters in LLMs.

In any case full kudos to the authors and I'm glad to see people continuing this work.

You can probably apply the same techniques 'Deep neural networks are robust to weight binarization and other non-linear distortions' used to get to 0.68 bits / weight to get your ternary weights below one bit; so you can claim they are still one-bit networks.
Could the reason that 3 states in this case be more efficient than 2 states be that 3 is closer to 2.718... (Euler's number) than 2 is?
Why not have some layers/nodes/systems be 2 states and have others be 3... couldn't you get arbitrarily close to Euler's number that way?
As aside, I'm curious: what was it like to work at IBM research, especially as a legacy industrial research org?
They cite straight through estimators in the previous work with many of the same authors on (actual binary) BitNet