Hacker News new | ask | show | jobs
by derefr 733 days ago
I think the thing they're saying that's novel, isn't what they have (LoRAs), but where and when and how they make them.

Rather than just pre-baking static LoRAs to ship with the base model (e.g. one global "rewrite this in a friendly style" LoRA, etc), Apple seem to have chosen a bounded set of behaviors they want to implement as LoRAs — one for each "mode" they want their base model to operate in — and then set up a pipeline where each LoRA gets fine-tuned per user, and re-fine-tuned any time the data dependencies that go into the training dataset for the given LoRA (e.g. mail, contacts, browsing history, photos, etc) would change.

In other words, Apple are using their LoRAs as the state-keepers for what will end up feeling to the user like semi-online Direct Preference Optimization. (Compare/contrast: what Character.AI does with their chatbot response ratings.)

---

I'm not as sure, from what they've said here, whether they're also implying that these models are being trained in the background on-device.

It could very well be possible: training something that's only LoRA-sized, on a vertically-integrated platform optimized for low-energy ML, that sits around awake but doing nothing for 8 hours a day, might be practical. (Normally it'd require a non-quantized copy of the model, though. Maybe they'll waste even more of your iPhone's disk space by having both quantized and non-quantized copies of the model, one for fast inference and the other for dog-slow training?)

But I'm guessing they've chosen not to do this — as, even if it were practical, it would mean that any cloud-offloaded queries wouldn't have access to these models.

Instead, I'm guessing the LoRA training is triggered by the iCloud servers noticing you've pushed new data to them, and throwing a lifecycle notification into a message queue of which the LoRA training system is a consumer. The training system reduces over changes to bake out a new version of any affected training datasets; bakes out new LoRAs; and then basically dumps the resulting tensor files out into your iCloud Drive, where they end up synced to all your devices.

4 comments

There is no way they would secretly train loras in the background of their user's phones. The benefits are small compared to the many potential problems. They describe some LoRA training infrastructure which is likely using the same capacity as they used to train the base models.

> ...each LoRA gets fine-tuned per user...

Apple would not implement these sophisticated user specific LoRA training techniques without mentioning them anywhere. No big player has done anything like this and Apple would want the credit for this innovation.

I don't think the LoRAs are fine-tuned locally at all. It sounds like they use RAG to access data.
Consider a feature from earlier in the keynote: the thing Notes (and Math Notes) does now where it fixes up your handwriting into a facsimile of your handwriting, with the resulting letters then acting semantically as text (snapping to a baseline grid; being reflowable; being interpretable as math equations) but still having the kind of long-distance context-dependent variations that can't be accomplished by just generating a "handwriting font" with glyph variations selected by ligature.

They didn't say that this is an "AI thing", but I can't honestly see how else you'd do it other than by fine-tuning a vision model on the user's own handwriting.

I didn't see the presentation but judging by your description, this is achievable using in-context learning.
For everything other than handwriting I don't think the LoRAs are fine-tuned locally.
Well, here's another one: they promised that your local (non-iCloud) photos don't leave the device. Yet they will now — among many other things they mentioned doing with your photos — allow you to generate "Memoji" that look like the people in your photos. Which includes the non-iCloud photos.

I can't picture any way to use a RAG to do that.

I can picture a way to do that that doesn't involve any model fine-tuning, but it'd be pretty ridiculous, and the results would probably not be very good either. (Load a static image2text LoRA tuned to describe the subjects of photos; run that once over each photo as it's imported/taken, and save the resulting descriptions. Later, whenever a photo is classified as a particular subject, load up a static LLM fine-tune that summarizes down all the descriptions of photos classified as subject X so far, into a single description of the platonic ideal of subject X's appearance. Finally, when asked for a "memoji", load up a static "memoji" diffusion LoRA, and prompt it with the that subject-platonic-appearance description.)

But really, isn't it easier to just fine-tune a regular diffusion base-model — one that's been pre-trained on photos of people — by feeding it your photos and their corresponding metadata (incl. the names of subjects in each photo); and then load up that LoRA and the (static) memoji-style LoRA, and prompt the model with those same people's names plus the "memoji" DreamBooth-keyword?

(Okay, admittedly, you don't need to do this with a locally-trained LoRA. You could also do it by activating the static memoji-style LoRA, and then training to produce a textual-inversion embedding that locates the subject in the memoji LoRA's latent space. But the "hard part" of that is still the training, and it's just as costly!)

That's going to be something similar to IPAdapter FaceID: https://ipadapterfaceid.com Basically you use a facial structure representation that you'd use for face recognition (which of course Apple already compute on all your photos) together with some additional feature representations to guide the image generation. No need for additional fine-tuning. A similar approach could likely be used for handwriting generation.
I believe this could be achieved by providing a seed image to the diffusion model and generating memoji based on it. This way fine tuning isn't required.
Yup this is pretty much it, and DALLE and others can do this already
I think you’re misunderstanding what they mean by adapting to use cases. See this passage:

> The adapter models can be dynamically loaded, temporarily cached in memory, and swapped — giving our foundation model the ability to specialize itself on the fly for the task at hand

This along with other statements in the article about keeping the base model weights unchanged says to me that they are simply swapping out adapters on a per app or per task basis. I highly doubt they will fine tune adapters on user data since they have taken a position against this. I wonder how successful this approach will be vs merging the adapters with the base model. I can see the benefits but there are also downsides.

Easel has been on iMessage for a bit now: https://apps.apple.com/us/app/easel-ai/id6448734086