| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by buran77 687 days ago

The "Mistral Pixtral multimodal model" really rolls off the tongue.

> It’s unclear which image data Mistral might have used to develop Pixtral 12B.

The days of free web scraping especially for the richer sources of material are almost gone, with anything between technical (API restrictions) and legal (copyright) measures building deep moats. I also wonder what they trained it on. They're not Meta or Google with endless supplies of user content, or exclusive contracts with the Reddits of the internet.

5 comments

simonw 687 days ago

What do you mean by copyright measures? Has anything changed on that front in the last two years?

My hunch is that most AI labs are already sitting on a pretty sizable collection of scraped image data - and that data from two years ago will be almost as effective as data scraped today, at least as far as image training goes.

dartos 687 days ago

The issue with image models is that their style becomes identifiable and stale quite quickly, so you’ll need a fresh intake of different, newer, styles every so often and that’s going to be harder and harder to get.

GaggiX 687 days ago

The style becoming identifiable and stale has mostly to do with CFG and almost nothing with the dataset, the heavy use of CFG by most models trades diversity with coherency. You don't need a costant intake of new images and styles, it's like saying that an image created two years ago is stale because it doesn't follow a new style or something.

Also Pixtral is not a text-to-image model.

p0rkbelly 687 days ago

There is the problem of literal style though. The aesthetics of say clothes do evolve overtime, not year to year big changes, but every 3-5? Sure. Just laughing at the thought of the model where any image generated is say stuck in 1990s grunge attire.

esafak 687 days ago

CFG for Classifier-Free Guidance?

GaggiX 687 days ago

Exactly, https://arxiv.org/abs/2207.12598

Jonathan Ho, one of the authors of the CFG paper, now works for Ideogram, and Ideogram 2 is one of the very few models (or perhaps the only one) where I don't see the artifacts caused by the CFG, maybe he has achieved a breakthrough.

Eisenstein 687 days ago

> Built on one of Mistral’s text models, Nemo 12B, the new model can answer questions about an arbitrary number of images of an arbitrary size given either URLs or images encoded using base64, the binary-to-text encoding scheme. Similar to other multimodal models such as Anthropic’s Claude family and OpenAI’s GPT-4o, Pixtral 12B should — at least in theory — be able to perform tasks like captioning images and counting the number of objects in a photo.

This is a not a diffusion model -- it doesn't create images, it answers questions.

namlem 687 days ago

Train LoRas for models that can take them

dartos 687 days ago

The issue is getting the data on newer aesthetic styles.

The more and more platforms lock down access to their data, the harder it’ll be for models to stay up to date on art trends.

We just haven’t had image gen around long enough to witness a major style change like the skeuomorphic iPhone icons of old to the new modern flat ones.

whimsicalism 687 days ago

solvable without additional images

dartos 687 days ago

It’s literally not.

If an artist born today develops their own style that takes the world by storm in 20years, the image generators of the time (for this thought experiment, imagine we’re using the same image gen techniques as today) would not know about it. They wouldn’t be able to replicate it until they get enough training data on that style.

bronco21016 687 days ago

At what point does an agent sitting at a browser collecting information differ from a human?

I have multiple ad-blockers running, how am I different from a bot scouring the “free” web? I get the idea of copyright and creators wanting to be paid for their content. However, I think there are plenty of human users out there not “paying” for “free” content either. Which one is a greater loss of revenue? A collection of over a million humans? Or 100 or so corporate bots?

a2128 687 days ago

Humans use Google Chrome from their home IP address that isn't on any blacklists, and they're always happy to make an account and download an app instead of accessing a website. Or at least that's what companies think humans are

GaggiX 687 days ago

>The days of free web scraping especially for the richer sources of material are almost gone

I would say the opposite, it has never been easier to collect a huge amount of data, in particular if you have a target, also you don't even need to write a line of code if you are good at explaining Claude 3.5 Sonnet what you want to achieve and the details.

jazzyjackson 687 days ago

You don't need a contract with reddit to scrape it, you can just add `.json` to any url and you'll get the entire thread as one object.

8n4vidtmkvmk 687 days ago

They have very heavy rate limits on their 1st party api now. I can't even delete my own content, nevermind scrape.

jazzyjackson 687 days ago

well, it's called "reddit" not "modify-via-API-it" :-)

htrp 687 days ago

there are torrents all over the internet of AI training data for images and video....

img2dataset also exists