| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by LuxBennu 142 days ago
	The title is misleading — there's no trained 100B model, just an inference framework that claims to handle one. But the engineering is worth paying attention to. I run quantized 70B models locally (M2 Max 96GB, llama.cpp + LiteLLM), and memory bandwidth is always the bottleneck. The 1.58-bit approach is interesting because ternary weights turn matmuls into additions — a fundamentally different compute profile on commodity CPUs. If 5-7 tok/s on a single CPU for 100B-class models is reproducible, that's a real milestone for on-device inference. Framework is ready. Now we need someone to actually train the model.

14 comments

embedding-shape 142 days ago

> Framework is ready. Now we need someone to actually train the model.

If Microslop aren't gonna train the model themselves to prove their own thesis, why would others? They've had 2 years (I think?) to prove BitNet in at least some way, are you really saying they haven't tried so far?

Personally that makes it slightly worrisome to just take what they say at face value, why wouldn't they train and publish a model themselves if this actually led to worthwhile results?

throwaw12 142 days ago

Because this is Microsoft, experimenting and failing is not encouraged, taking less risky bets and getting promoted is. Also no customer asked them to have 1-bit model, hence PM didn't prioritize it.

But it doesn't mean, idea is worthless.

You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea.

embedding-shape 142 days ago

> You could have said same about Transformers, Google released it, but didn't move forward,

I don't think you can, Google looked at the research results, and continued researching Transformers and related technologies, because they saw the value for it particularly in translations. It's part of the original paper, what direction to take, give it a read, it's relatively approachable for being a machine learning paper :)

Sure, it took OpenAI to make it into an "assistant" that answered questions, but it's not like Google was completely sleeping on the Transformer, they just had other research directions to go into first.

> But it doesn't mean, idea is worthless.

I agree, they aren't, hope that wasn't what my message read as :) But, ideas that don't actually pan out in reality are slightly less useful than ideas that do pan out once put to practice. Root commentator seems to try to say "This is a great idea, it's all ready, only missing piece is for someone to do the training and it'll pan out!" which I'm a bit skeptical about, since it's been two years since they introduced the idea.

zozbot234 142 days ago

What OpenAI did was train increasingly large transformer model instances. which was sensible because transformers allowed for a scaling up of training compared to earlier models. The resulting instances (GPT) showed good understanding of natural language syntax and generation of mostly sensible text (which was unprecedented at the time) so they made ChatGPT by adding new stages of supervised fine tuning and RLHF to their pretrained text-prediction models.

mattalex 142 days ago

There were plenty of models the size of gpt3 in industry.

The core insight necessary for chatgpt was not scaling (that was already widely accepted): the insight was that instead of finetuning for each individual task, you can finetune once for the meta-task of instruction following, which brings a problem specification directly into the data stream.

joquarky 142 days ago

I miss having the completion models like davinci-003 since it gained in performance where it lacked simplicity to get what you want out.

It was fun to come up with creative ways to get it to answer your question or generate data by setting up a completion scenario.

I guess "chat" became the universal completion scenario. But I still feel like it could be "smarter" without the RLHF layer of distortion.

Schlagbohrer 142 days ago

Google had been working on a big LLM but they wanted to resolve all the safety concerns before releasing it. It was only when OpenAI went "YOLO! Check this out!" that Google then internally said, "Damn the safety concerns, full speed ahead!" and now we find ourselves in this breakneck race in which all safety concerns have been sidelined.

gardnr 142 days ago

Scaling seemed like the important idea that everyone was chasing. OpenAI used to be a lot more safety minded because it was in their non profit charter, now they’ve gone for-profit and weaponized their tech for the USA military. Pretty wild turnaround. Saying OpenAI was cavalier with safety in the early days is inaccurate. It was a skill issue. Remember Bard? Google was slow.

joquarky 142 days ago

They thought people might prefer quality and safety.

wongarsu 142 days ago

On the one hand, not publishing any new models for an architecture in almost a year seems like forever given how things are moving right now. On the other hand I don't think that's very conclusive on whether they've given up on it or have other higher priority research directions to go into first either

vineyardmike 141 days ago

> You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea

Google released transforms as research because they invented it while improving Google Translate. They had been running it for customers for years.

Beyond that, they had publicly-used transformer based LMs ("mums") integrated into search before GPT-3 (pre-chat mode) was even trained. They were shipping transformer models generating text for years before the ChatGPT moment. Literally available on the Google SERP page is probably the widest deployment technology can have today.

Transformers are also used widely in ASR technologies, like Google Assistant, which of course was available to hundreds of millions of users.

Finally, they had a private-to-employees experimental LLMs available, as well as various research initatives released (meena, LaMDA, PaLM, BERT, etc) and other experiments, they just didn't productize everything (but see earlier points). They even experimented with scaling (see "Chinchilla scaling laws").

GorbachevyChase 142 days ago

The most benign answer would be that they don’t want to further support an emerging competitor to OpenAI, which they have significant business ties to. I think the more likely answer which you hinted at is that the utility of the model falls apart as scale increases. They see the approach as a dead end so they are throwing the scraps out to the stray dogs.

riskable 142 days ago

Not to mention Microsoft's investments in Nvidia and other GPU-adjacent/dependent companies!

A successful ternary model would basically erase all that value overnight. In fact, the entire stock market could crash!

Think about it: This is Microsoft we're talking about! They're a convicted monopolist that has a history of manipulating the market for IT goods and services. I wouldn't put it past them to refuse to invest in training a ternary model or going so far as to buy up ternary startups just to shut them down.

Want to make some easy money: Start a business training a ternary model and make an offer to Microsoft. I bet they'll buy you out for at least a few million even if you don't have a product yet!

hnlmorg 142 days ago

If that were true then they simply wouldn’t have published this research to begin with.

Occam’s Razor suggests this simply doesn’t yield as good results as the status quo

observationist 142 days ago

So is it finally time for a Beowulf cluster to do something amazing?

joquarky 142 days ago

Maybe! And it looks like Windows 11 will finally make this the year of the Linux desktop...

grepfru_it 142 days ago

I'll believe that when CowboyNeaLLM is released

embeddnet 142 days ago

Rest assured, all the big players (openai, google, deepseek etc) have run countless experiments with 4,3,2,1.58,1 bits, and various sparse factors and shapes. This barrel has been scraped to the bottom

Aerroon 142 days ago

I have doubts about this. Perhaps the closed models have, but I wouldn't be so sure for the open ones.

GLM 5, for example, is running 16-bit weights natively. This makes their 755B model 1.5TB in size. It also makes their 40B active parameters ~80GB each.

Compare this to Kimi K2.5. 1T model, but it's 4-bit weights (int4), which makes the model ~560 GB. Their 32B active parameters are ~16 GB.

Sure, GLM 5 is the stronger model, but is that price worth paying with 2-3x longer generation times? What about 2-3x more memory required?

I think this barrel's bottom really hasn't been scraped.

gregman1 142 days ago

Cannot agree more!

deepsquirrelnet 142 days ago

The title being misleading is important as well, because this has landed on the front page, and the only thing that would be the only notable part of this submission.

The "new" on huggingface banner has weights that were uploaded 11 months ago, and it's 2B params. Work on this in the repo is 2 years old.

The amount of publicity compared to the anemic delivery for BitNet is impressive.

wongarsu 142 days ago

I've also always though that it's an interesting opportunity for custom hardware. Two bit addition is incredibly cheap in hardware, especially compared to anything involving floating point. You could make huge vector instructions on the cheap, then connect it to the fastest memory you can buy, and you have a capable inference chip.

You'd still need full GPUs for training, but for inference the hardware would be orders of magnitude simpler than what Nvidia is making

monocasa 142 days ago

These are trits, which provide their own efficiencies.

Interestingly, a trit x float multiplier is cheaper than a trit x integer multiplier in hardware if you're willing to ignore things like NaNs.

0 and 1 are trivial, just a mux for identity and zero. But because floats are sign-magnitude, multiply by -1 is just an inverter for the sign bit, where as for integers you need a bitwise inverter and full incrermenter.

buo 142 days ago

Do you know a good reference to learn more about this (quantizing weigths to 1.58 bits, and trit arithmetic)?

fc417fc802 142 days ago

There's lots of literature on quantizing weights (including trits and binary) going back 15+ years. Nothing to hand right now but it's all on arxiv.

The relevant trit arithmetic should be on display in the linked repo (I haven't checked). Or try working it out for the uncompressed 2 bit form with a pen and paper. It's quite trivial. Try starting with a couple bitfields (inputs and weights), a couple masks, and see if you can figure it out without any help.

regularfry 142 days ago

You only need GPUs if you assume the training is gradient descent. GAs or anything else that can handle nonlinearities would be fine, and possibly fast enough to be interesting.

riidom 142 days ago

Text is misleading too. 5-7 tok/sec is not reading speed, it's a tad slower. For me, at least, and I am an experienced reader, not especially schooled in quick-reading though.

I happened to "live" on 7.0-7.5 tok/sec output speed for a while, and it is an annoying experience. It is the equivalent of walking behind someone slightly slower on a footwalk. I dealt with this by deliberately looking away for a minute until output was "buffered" and only then started reading.

For any local setup I'd try to reach for 10 tok/sec. Sacrifice some kv cache and shove a few more layers on your GPU, it's worth it.

WithinReason 142 days ago

> a fundamentally different compute profile on commodity CPU

In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction

ismailmaj 142 days ago

You drop the memory throughput requirements because of the packed representation of bits so an FMA can become the bottleneck, and you bypass the problem of needing to upscale the bits to whatever FP the FMA instruction needs.

typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.

WithinReason 142 days ago

yes but this is not 1 bit matmul, it's 1.58 bits with expensive unpacking

ismailmaj 142 days ago

The title and the repo uses 1-bit when it means 1.58 bits tertiary values, it doesn't change any of my arguments (still xors and pop_counts).

WithinReason 142 days ago

How do you do ternary matmul with popcnt on 1.58 bit packed data?

ismailmaj 142 days ago

Assuming 2 bit per values (first bit is sign and second bit is value).

actv = A[_:1] & B[_:1]

sign = A[_:0] ^ B[_:0]

dot = pop_count(actv & !sign) - pop_count(actv & sign)

It can probably be made more efficient by taking a column-first format.

Since we are in CPU land, we mostly deal with dot products that match the cache size, I don't assume we have a tiled matmul instruction which is unlikely to support this weird 1-bit format.

ActivePattern 142 days ago

The win is in how many weights you process per instruction and how much data you load.

So it's not that individual ops are faster — it's that the packed representation lets each instruction do more useful work, and you're moving far less data from memory to do it.

actionfromafar 142 days ago

Bitnet encoding more information dense per byte perhaps? CPUs have slow buses so would eke out more use of bandwidth?

DrBazza 142 days ago

> memory bandwidth is always the bottleneck

I'm hoping that today's complaints are tomorrow's innovations. Back when 1Mb hard drive was $100,000, or when Gates said 640kb is enough.

Perhaps some 'in the (chip) industry' can comment on what RAM manufacturers are doing at the moment - better, faster, larger? Or is there not much headroom left and it's down to MOBO manufacturers, and volume?

fc417fc802 142 days ago

Chip speed has increased faster than memory speed for a long time now, leaving DRAM behind. GDDR was good for awhile but is no longer sufficient. HBM is what's used now.

The last logical step of this process would be figuring out how to mix the CPU transistors with the RAM capacitors on the same chip as opposed to merely stacking separate chips on the same package.

A related stopgap is the AI startup (forget which) making accelerators on giant chips full of SRAM. Not a cost effective approach outside of ML.

azeirah 142 days ago

Cerebras?

Aerroon 142 days ago

We have faster memory, it's just all used in data center cards you can't buy (and can't afford to buy).

AMD actually used HBM2 memory in their Radeon VII card back in 2019 (!!) for $700. It had 16 GB of HBM2 memory with 1 TB/s throughput.

The RTX 5080 in conversion l comparison also has 16 GB of VRAM, but was released in 2025 and has 960 GB/s throughput. The RTX 5090 does have an edge at 1.8 TB/s bandwidth and 32 GB of VRAM but it also costs several times more. Imagine if GPUs had gone down the path of the Radeon VII.

That being said, the data center cards from both are monstrous.

The Nvidia B200 has 180 GB of VRAM (2x90GB) offering 8.2 TB/s bandwidth (4.1 TB/s x2) released in 2024. It just costs as much as a car, but that doesn't matter, because afaik you can't even buy them individually. I think you need to buy a server system from Nvidia or Dell that will come with like 8 of these and cost you like $600k.

AMD has the Mi series. Eg AMD MI325x. 288 GB of VRAM doing 10 TB/s bandwidth and released in 2024. Same story as Nvidia: buy from an OEM that will sell you a full system with 8x of these (and if you do get your hands on one of these you need a special motherboard for them since they don't do PCIe). Supposedly a lot cheaper than Nvidia, but still probably $250k.

These are not even the latest and greatest for either company. The B300 and Mi355x are even better.

It's a shame about the socket for the Mi series GPUs (and the Nvidia ones too). The Mi200 and Mi250x would be pretty cool to get second-hand. They are 64 GB and 128GB VRAM GPUs, but since they use OAP socket you need the special motherboard to run them. They're from 2021, so in a few years time they will likely be replaced, but as a regular joe you likely can't use them.

The systems exist, you just can't have them, but you can rent them in the cloud at about $2-4 per hour per GPU.

bigyabai 142 days ago

For larger contexts, the bottleneck is probably token prefill instead of memory bandwidth. Supposedly prefill is faster on the M5+ GPUs, but still a big hurdle for pre-M5 chips.

joquarky 142 days ago

It might be advantageous to have a different memory structure altogether, bespoke to the specific task.

rustyhancock 142 days ago

Yes. I had to read it over twice, it does strike me as odd that there wasn't a base model to work with.

But it seems the biggest model available is 10B? Somewhat unusual and does make me wonder just how challenging it will be to train any model in the 100B order of magnitude.

wongarsu 142 days ago

Approximately as challenging as training a regular 100B model from scratch. Maybe a bit more challenging because there's less experience with it

The key insight of the BitNet paper was that using their custom BitLinear layer instead of normal Linear layers (as well as some more training and architecture changes) lead to much, much better results than quantizing an existing model down to 1.58 bits. So you end up making a full training run in bf16 precision using the specially adapted model architecture

naasking 142 days ago

What's unusual about it? It seems pretty standard to train small models to validate an approach, and then show that training scales with model size to 8B to 14B parameter models, which is what they did.

cat_plus_plus 142 days ago

There are 1 bit average GGUFs of large models, not perfect quality but they will hold a conversation. These days, there is also quantized finetuning to heal the damage.

august11 142 days ago

In their demo they're running 3B model.

webXL 142 days ago

It comes from (intentionally?) misleading docs: https://github.com/microsoft/BitNet/issues/391

(only suggesting that it's intentional because it's been there so long)

verdverm 142 days ago

That issue appears to be the one that's wrong. From the technical report

> We evaluated bitnet.cpp in terms of both inference speed and energy cost. Comprehensive tests were conducted on models with various parameter sizes, ranging from 125M to 100B. specific configurations for each model are detailed in the Appendix A.

webXL 142 days ago

Thanks for pointing that out. I'll ask the issue creator if they've considered that. Would be nice if the maintainer would handle that (sigh) and link to the actual models used for testing (double sigh).

verdverm 142 days ago

From what I gather, there are no models, this is a framework for running 1bit models, but none have been trained. They are mainly demonstrating the possibility.

verdverm 142 days ago

I also don't expect those with poor MCPs to have any better CLIs or APIs, most of the big companies we want them for are not investing in DX/AX. I suspect i.e. that Intuit, if they had great APIs et al, would see it as a threat to their business.

Boy would I love to give my agent access to my Quickbooks. They pushed out an incomplete MCP and haven't touched it since.

https://github.com/intuit/quickbooks-online-mcp-server

cubefox 142 days ago

LLM account

Springtime 142 days ago

Hmm, the user joined in 2019 but had no submissions or comments until just 40 minutes ago (at least judging by the lack of a second page?) and all the comments are on AI related submissions. Benefit of doubt is it'd have to be a very dedicated lurker or dormant account they remembered they had.

Edit: oh, just recalled dang restricted Show HNs the other day to only non-new users (possibly with some other thresholds). I wonder if word got out and some are filling accounts with activity.

verdverm 142 days ago

There has been a shift to the Ai accounts, they use Show HN less now. This started before dang's comment, I assume because they saw the earlier posts about the increase in quantity / decrease in quality.

I suspect that they are trying to fake engagement prior to making their first "show" post as well.

LuxBennu 142 days ago

Fair enough — I've been lurking since 2019 and picked a bad day to start commenting on everything at once. Not a bot, just overeager. I'll pace myself.

Springtime 142 days ago

Your account posted dense, opinionated and structured paragraphs mere minutes apart—sometimes the same minute—for multiple story submissions. Even with my own sometimes lengthy replies this would be infeasible to both instantly have structured opinions and type them out in time. Two of your posts were posted the same minute, with a combined word count of 146.

It feels like it'd take someone superhuman to come across different stories, have such opinions and type and submit both of these in that timeframe or queuing up comments to post rapid-fire.

Conspicuously too, as another pointed out, is every single comment of yours uses an em dash, which despite occasionally using myself (hey look they're in this reply) is not in every single comment. Idk, if I was being seriously accused of botting I'd put more reasoning into my response about it.

Karrot_Kream 142 days ago

Lol. I know at least a few high karma account who post at the same frequency but they post about anti-AI and anti-tech topics instead on the big social media tech where anti-tech opinions dominate. I guess this exempts them from scrutiny? I love these witch hunts.

Springtime 142 days ago

They post 146 words per minute across multiple different submissions with similarly structured posts? I know there are users who post frequently in other communities I'm familiar with but not in that kind of timeframe with such paragraph density or structure.

bottlepalm 142 days ago

It's scary, without the em dashes, and the rapid fire commenting of the account - who would ever realize this is a bot? Two easy to fix things, and after that it'd be very difficult to tell that this is a bot.

It's not a question of if there are other bots out there, but only what % of comments on HN right now and elsewhere are bot generated. That number is only going to increase if nothing is done.

152334H 142 days ago

Looks like gradual disempowerment is already happening - the minority of humans who are capable of spotting AI content are losing the struggle for attention on all major social networks

Jowsey 142 days ago

Agreed. This is becoming an issue, see also: https://news.ycombinator.com/item?id=47259308

orbital-decay 142 days ago

Funny enough I now involuntarily take RTFA as a slight slop signal, because all these accounts dutifully read the article before commenting, unlike most HNers who often respond to headlines.

vova_hn2 142 days ago

First they claimed that if you use em dashes you are not human

And I did not speak out

Because I was not using em dashes

Then they claimed that if you're crammar is to gud you r not hmuan

And I did not spek aut

Because mi gramar sukcs

Then they claimed that if you actually read the article that you are trying to discuss you are not human...

K0balt 142 days ago

I’ve been rounded up for things I wrote two decades ago because of my em dashes lol. The pitchfork mentality gives me little hope for how things are going to go once we have hive mind AGI robots pervasive in society.

vova_hn2 142 days ago

If I was operating a bot farm, at this point I would probably add some bots that go around and accuse legit human users (or just random users) of being bots.

Created confusion and frustration will make it much harder to separate signal from the noise for most people.

SoftTalker 142 days ago

I once spent some time learning the proper usage of em-dashes, en-dashes, and hyphens, and tried to be conscientious about using them properly in my writing. Little did I know it would be wasted effort in the LLM era, when competent writing actually became a negative.

Not only are we losing the ability to communicate clearly without the assistance of computers, those who can are being punished for it.

pxndx 142 days ago

There's obviously an xkcd about this: https://xkcd.com/810/

yorwba 142 days ago

Not all of them do: https://news.ycombinator.com/item?id=47335156 There are evidently lots of people experimenting with different botting setups. Some do better at blending in than others.

PeterHolzwarth 142 days ago

Interesting - the account you mention, and the GP, are both doing replies that are themselves all about the same length, and also the same length between the two accounts. I get what you mean.

xdennis 142 days ago

> Funny enough I now involuntarily take RTFA

Residential Treatment Facility for Adults? Red Tail Flight Academy?

orbital-decay 142 days ago

Reading the fine article

cubefox 142 days ago

Yeah. It correctly pointed out that the editorialized HN title is wrong, there is no 100B model.

nkohari 142 days ago

I would love to understand the thought process behind this. I'm sure it's a fun experiment, to see if it's possible and so on... but what tangible benefit could there be to burning tokens to spam comments on every post?

cyanydeez 142 days ago

Check out the new QWEN coder model.

Also, isnt there different affinities to 8bit vs 4bit for inferences

RandomTeaParty 142 days ago

> The 1.58-bit approach

can we stop already with these decimals and just call it "1 trit" which it exactly is?

hsbauauvhabzb 142 days ago

Yeah because THAT won’t confuse the average reader.

butILoveLife 142 days ago

>. I run quantized 70B models locally (M2 Max 96GB, llama.cpp + LiteLLM), and memory bandwidth is always the bottleneck.

I imagine you got 96gb because you thought you'd be running models locally? Did you not know the phrase Unified Memory is marketing speak?