Hacker News new | ask | show | jobs
by NickGerleman 1169 days ago
Worth pointing out, there has been quite a bit of contention around this change, both technical, and some accusations of plagiarism/miscrediting here. https://github.com/ggerganov/llama.cpp/pull/711
7 comments

I don't get the "plagiarism/miscrediting" accusations. This was in the original PR (https://github.com/ggerganov/llama.cpp/pull/613):

> This PR was written in collaboration with @slaren. This PR is also rebased on PR #586 so please do not squash merge! Use either merge or rebase.

jart made sure to that the other user got credit, in addition to making sure that their name was properly attributed in the commit log. Given all this, it feels like the drama--shouldn't exist? Like, if there's an issue with attribution, it's not because of bad-faith, and I feel like a good-faith conversation could have just resolved this, instead of bringing in trolls.

That's not the original PR. jart was working on a malloc() approach that didn't work and slaren wrote all the code actually doing mmap, which jart then rebased in a random new PR, changed to support an unnecessary version change, magic numbers, a conversion tool, and WIN32 support when that was already working in the draft PR. https://archive.ph/Uva8c

This is the original PR: https://github.com/ggerganov/llama.cpp/pull/586.

Jart's archived comments:

"my changes"

"Here's how folks in the community have been reacting to my work."

"I just wrote a change that's going to let your LLaMA models load instantly..."

https://archive.ph/PyPFZ

"I'm the author"

https://archive.ph/qFrcY

"Author here..."

"Tragedy of the commons...We're talking to a group of people who live inside scientific papers and jupyer notebooks."

"My change helps inference go faster."

"The point of my change..."

"I stated my change offered a 2x improvement in memory usage."

https://archive.ph/k34V2

"I can only take credit for a 2x recrease in RAM usage."

https://archive.ph/MBPN0

"I just wrote a change that's going to let your LLaMA models load instantly, thanks to custom malloc() and the power of mmap()"

https://archive.ph/yrMwh

slaren replied to jart on HN asking her why she was doing and saying those things, and she didn't bother to reply to him, despite replying to others in that subthread within minutes. https://archive.ph/zCfiJ

Hmm, based on what you've quoted here and knowing nothing else but a few messages on AI Twitter I would invest in jart.

This is BillG-style product skill -- there is a ton of work that goes into representing a piece of software as something important and valuable that people should buy into.

Jart is a pretty exceptional engineer, even if she wrote this patch single-handedly it would hardly be a footnote in her list of professional accomplishments. This is the author of Cosmopolitan libc, redbean and APE we're talking about, after all.

That being said, it's important to attribute work properly. It can be easy to mix things up (eg. "my patch" is excusable) but repeatedly insisting authorship when you're not the author of the change just seems disingenuous. I'm sure it was in good faith, but since they didn't address the issue or clear anything up, it's come to this.

Dramatic, and hardly the conclusion people wanted to the story of a free performance improvement. It's not entirely contrived though, and I think the maintainer handled this exceptionally well given the circumstances.

> This is the author of Cosmopolitan libc, redbean and APE we're talking about, after all.

Is this? If she so easily misrepresented slarens work as hers in this case, what other work isn't actually attributable to jart?

I'm all for detracting from suspicious authors, but it's unlikely Justine just steals their code wholecloth. She's been an active community member for a while, and wrote a lot of impressive software before LLMs and script kiddies democratized the whole process.

In this specific instance, jart had a communication error that she failed to clarify, and so things compounded from there. The part that she didn't author is clearly defined in Git, and the most-plausible explanation is an honest mistake. Assuming ill-intent requires you to ignore the original context of the disagreement and focus on the outrage, which pretty much says it all.

That being said, I'd love to hear what evidence you have to the contrary. Maybe you've got a link to an FTP server from 2001 with the Blinkenlights source code on it, I can't say for sure. A fraud probably doesn't write in-depth patch breakdowns on their personal blog for fun, though.

> > This PR was written in collaboration with @slaren. This PR is also rebased on PR #586 so please do not squash merge! Use either merge or rebase.

I read that PR (didn't click any links) and here on HN posted a "Great work" to jart. The reason I did that is precisely because those final lines in the PR came across as an upright acknowledgement that some people helped out. I also got the impression that jart was a co-owner of the project with all the "we"s that were thrown around.

If I was writing that PR, it would be something like "this PR consolidates slaren's mmap approach with additional work done for ... by myself". After hearing about the drama, actually reading slaren's PR, and reviewing jart's comments in issues and the PR and the hn show and tell, I am now convinced this is someone who wants to steal other people's thunder. Heck, even this front page article is yet another PR stunt. I suspect "faster fork of llama.cpp" posts will follow.

Giorgi Gerganov remains for me the hacker hero here as far as LLMs are concerned -- mmap is kiddie stuff to be frank, but anyone who gets whisper and llama to work on my laptop with a handful of files (many thanks to you sir) has my technical respect. And I think he has made the right call regarding the project.

I think that Georgi regrets making the project so openly to PR, he was probably happier with running it on his own.
Also worth pointing out that you can follow the thread’s link to Rentry, which links to a 4chan (?) archived thread, where you can see anons getting worked up over jart being a trans internet celebrity. And unless you’re playing dumb, you have to admit they were looking for an excuse to troll jart. Unless you seriously want me to believe they were all that mad about… mmap
I can understand these folks struggling with what mmap is actually doing. But this isn't a new discussion about the qualities of MMAP versus file based IO etc. Although, many of the comments stated are quite wrong.

Related Work on this problem: 1. https://www.mongodb.com/blog/post/getting-storage-engines-re... - talks about developments on MongoDB's backend to use mmap. 2. https://www.pdl.cmu.edu/PDL-FTP/Database/p13-crotty.pdf - Talks about some of the cons of mmap, some I think are not as prevalent due to the existence of low latency, high throughput storage devices. 3. https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/_my_direct_up... - less relevant but related.

I feel significantly dumber for reading that merge request.

The one thing to understand is that the performance implications of mmap are subtle and only work when you have much more RAM than the files you're mapping in.

> only work when you have much more RAM than the files you're mapping in.

Really depends on what you're doing, like memory access patterns. I've definitely seen scenarios when mapping hundreds of gigabytes of data on dozens of gigabytes of ram where mmap has been an almost absurd performance boost over traditional I/O, both immediately but also asymptotically as all the most frequently accessed data ends up in cache and the least accessed data is paged out.

I don't disagree with the subtlety part though. It's very difficult to reason about I/O performance in general. Modern systems are like an onion of hidden performance optimization tricks and caching layers (both in software and hardware).

Yeah and on top of that, different systems (software and hardware combos) are different, so I can see the performance of this depending on the implementation of mmap on the system and the implementation of caches and virtual memory on the architecture. When I've debugged stuff like this, it's either been for myself in which case I know what combo I'm running on or it's been for work where we know which combinations we target and we run regression tests to observe perf implications.
> least accessed data is paged out

Aren't all the weights touched in every pass?

Speaking in general.
In this case, the main benefit is from multiple invocations of the same program.

Using mmap, you avoid doing any work at all the 2nd time you load the file.

Yes- I have 35 years experience with UNIX and used to use mmapping with BLAST, a sequence search tool, as well as my own codes.

I'll repeat myself: mmap is subtle. If what you mmap is larger than your host RAM, only some of the pages will be loaded at any time, and depending on access patterns, can lead to significant paging.

What do you mean by work. The underlying page cache will keep much of the data actual cached if it's recent. Even databases like PostGreSQL use this to their advantage (https://github.com/postgres/postgres/blob/master/src/backend...).
Copying the file backed pages to heap memory and possibly having to swap them out.
I may have parsed your statement incorrectly, but I'm assuming you are talking about the copy of data when using either mmap or File IO (memcpy versus write) Whether you do File IO versus mmap, there's going to be copy. With files, the copy occurs within kernel space with data being copied into the pages in the buffer cache, with mmap the copy occurs in userspace with data being copied into the address space. Swapping can occur in the buffer cache or mmap, this is why so many databases implement their own buffer cache to ensure specific data isn't flushed, leaving them in an inconsistent state.

An advantage of copying in userspace is the ability to use more performant instructions to perform the memcopy, which the kernel does not typically have access to (https://www.mongodb.com/blog/post/getting-storage-engines-re...)

> With files, the copy occurs within kernel space with data being copied into the pages in the buffer cache, with mmap the copy occurs in userspace with data being copied into the address space.

There is no copy with mmap, the page is either unwritable or CoW. There's always a copy with read(). (But read() can still be faster and more memory efficient nevertheless.)

> An advantage of copying in userspace is the ability to use more performant instructions to perform the memcopy, which the kernel does not typically have access to (https://www.mongodb.com/blog/post/getting-storage-engines-re...)

Darwin kernel does though.

I believe Linux uses the builtin old memcpy instructions on Intel, just to force CPU vendors to keep them usable.

Unfortunately Justine has attracted a peculiar fanbase+haterbase. As their numbers swell the collective intelligence and technical understanding diminishes.

So the discussions end up gravitating towards weird drama. I wish you wouldn't have linked this thread. Theres going to be a bunch of stupid comments here as well about how great/awful jart is.

I'm not a fan or a hater, I didn't even know who this person was until this thread.

Does the change deserve a blog post or wild claims like "llama.cpp is 100x faster and uses half the memory!"? No. The original PR looks like a decent addition but the blog posts reads as incredibly narcissistic (i.e. lots of language like "We spent several weeks volunteering" and "our project") uh whatever. It also breaks a backwards compatibility when there's no technical reason it couldn't have been optional or put behind a feature flag, plus a ton of condescending language in the PR. Not really the kind of work I'd be proud of or would be advertising in a blog post.

Yes, exactly.

The claim that it uses half the memory was probably a honest mistake. The ensuing disappointment that it did not in fact halve memory usage and drama attracted trolls and white knights and is icky. The discussion around nmap I suppose is subtle and when emotion abounds can no longer be had. :/

> The original PR looks like a decent addition but the blog posts reads as incredibly narcissistic

Better than most stuff I see in the corporate world.

Is this related to her advocacy for neoreactionary politics or is it just a transphobia thing?
I mean, there's also the part where she's wrong a lot.
Is mmap really that broken on Windows? Or is the poster just confused that the data stays in the page cache? But that’s what the page cache does - that memory will be used for other things if needed, but if the memory is not needed it might as well keep the old data in cache.
No, mmap on Windows is fine. A generous, charitable statement would be that the OP on that thread is very confused, but based on some comments elsewhere on this thread about jart attracting a chorus of haters, it seems more likely that they're just trolling.
There's a weird breed of programmer who only wants to see the free memory column in top be maximized. I bought all this RAM and I want to make sure none of it is used in case I want to use it later.
only thing this discussion has showed me is that more people need Computer Science degrees again

like, wow, mmap and paging. really guys?

I feel the same.

I maybe should not be surprised, given that we live in the era of Unity and Electron, but using mmap() to load large files should be not be seen as rocket science.

And this is basically available on almost any platform with a MMU and a kernel.

Using memory mapped files is not always the right answer.

Memory mapped files have their disadvantages. The biggest disadvantage is that any disk read error (or yanking the USB drive) becomes an access violation exception (also known as a crash), just like you read from a bad pointer. You need to have robust exception handling, which is a taller order than just checking a return value.

Another disadvantage is that even when you have your pages mapped into memory, calling the page fault handler and getting your page has a cost of ~1200 CPU cycles on Windows just to do the User<->Kernel mode transition, plus the cost of actually performing the IO. "Just reading the file" skips many User<->Kernel mode transitions, so it's one per read call rather than one per page fault.

although it's true that many hardware problems exhibit as SIGBUS on memmapped memory, remember that this is an API and implementation written for high performance disk drives on important servers; for example, the ingres server on berkeley's research vax (IIRC mmap became used widely after one of the BSD 4.3 subreleases was released). IE, at the time, the idea of a drive that could be easily detached being used for production computing would have been crazy so I think crashing the app when a drive is removed is not completely insensible.
The fault will also raise a signal if there is an error reading the sector from the drive (what would be an EIO from read()). Lack of error handling in mmap isn't only a problem for removable media.
yes, that sounds like a good idea to me. Like I said: if you use mmap, the expectation is that the drive will not bork and if it does, it should terminate the application.
In addition to a drive being removed, it also happens for a network share over wifi when the connection is temporarily lost.
Wouldn't huge pages and readahead make number of page faults and context switches potentially smaller than with read()?
I think there just hasn't been a consumer application that is really resource constrained, for a long time now. Only things for enthusiasts have been. LLMs have product market fit, but running a useful one client side is resource constrained, but instead of it truly being a consumer hardware limitation, it just turns out they were never optimized to begin with - coming from the perceived "top AI/ML minds" at FAANGs, while some of the most basic optimizations are seemingly a lost art.

On the other hand, its only been a few weeks, so maybe I should ignore this absurdity and just wait.

Probably a combination of (a) ML framework people not paying much attention to CPU inference due to already having GPUs/TPUs already lying around for training - CPU inference is just for very quick experiments (b) research code has never been the best optimized for performance (c) ML people are not generally systems programmers, and a lot of systems programmers are afraid to mess with the ML code outside of low-level computation kernels (doesn't help that ML code is notoriously unreproducible).
It's indeed a very different world. This model was trained on thousands of GPUs. The weird file format corresponds to the train time sharding of the weights. And really nobody is doing CPU inference with all the GPU we have. And also the "CLI" use case seems contrieved to me. If you plan to interact several times with the model and want to keep the weights in RAM, why don't you start a REPL or spin up a server?
> while some of the most basic optimizations are seemingly a lost art

mmap isn't relevant to anyone except CPU-using programmers because other hardware doesn't have virtual memory paging. Firmware programmers don't care, GPU programmers don't care.

AFAIK CUDA offers unified memory which basically works with virtual address space and page faulting in data from main memory. There is also IOMMU in general.
Many of us would like to get rid of the host CPU and have ML trainers that are just GPUs and drives and NICs all attached to a northbridge. The GPU has everything required to make disk requests over the bus, and ideally the drive can receive network messages that get plumbed straight to the drive (I'm only partially joking).
Word embeddings were big for their time (especially with subword embeddings like fastText). We mmaped word embeddings for similar reasons. But yeah, I was kinda surprised that one post about LLaMa.cpp mmap support talked about a 'fairly new technique'. mmap has been in a UNIX programmer's tool belt for literally decades.
> never optimized to begin with

I think the better read is that they're being adapted to new applications, constraints, and environments, all at once.

Why would Facebook care about running LLAMA on a cpu with optimizing for 1-2% more latency when it has a lot of A100s laying around?
> only thing this discussion has showed me is that more people need Computer Science degrees again

You have too much faith in unis. Mine did not teach me about mmap at all.

I'm in a grad program for Software Engineering. At my university, the only difference between the Comp Sci and Software Engineering degree is that comp sci requires an advanced algorithm class whereas software engineering has a capstone class where you have to work with a team to build a MVP that is unit tested, uses CI/CD, and obviously works.

I say this to highlight the parent comment. I'm essentially in a computer science program and we have learned absolutely 0 about paging or memory in any of my required courses. We practically don't touch OS anything in any of the classes. That's not to say the courses for that aren't offered but they aren't part of the core curriculum and over my time in my program, they've mostly not been offered due to lack of student interest.

I did learn how to use linked lists like a champion though!

I actually don't understand this. I think a lot of regular hn commentators are just avoiding these threads [given the questionable circumstances surrounding the related PRs and the "drama"].

We've had regular discussions on HN about various storage engines, how the latencies are cut down, etc. I share your surprise at hearing 'wow, mmap!' and all the debates in the issues as what it actually does.

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

Part of the problem is that this is the domain of computer engineering and not computer science.

Self respecting computer engineering curriculums will cover MMUs, page tables, TLBs, hardware interrupts, and page caches which once you know about mmap is fairly simple to understand.

The fundamentals really haven’t changed much in the past 40 years.