Hacker News new | ask | show | jobs
by rom1504 1383 days ago
Hi, laion5b author here,

Nice tool!

You can also explore the dataset there https://rom1504.github.io/clip-retrieval/

Thanks to approximate knn, it's possible to query and explore that 5B datasets with only 2TB of local storage, anyone can download the knn index and metadata to run that locally too.

Regarding duplicates, indeed it's an interesting topic!

Laion5b deduplicated samples by url+text, but not by image.

To deduplicate by image you need to have an efficient way to compute whether image a and b are the same.

An idea to do that is to compute an hash based on clip embeddings. A further idea would be to train a network actually good at dedup and not only similarity by training on positive and negative pairs, eg with triple loss.

Here's my plan on the topic https://docs.google.com/document/d/1AryWpV0dD_r9x82I_quUzBuR...

If anyone is interested to participate, I'd be happy to guide them to do that. This is an open effort, just join laion discord server and let's talk.

2 comments

You are probably very aware of it, but just to highlight the importance of this for people who aren't aware: data duplication degrades the training and makes memorization (and therefore plagiarism, in the technical sense) more likely. For language models, this includes near-similarities, which I'd guess would extend to images.

Quantifying Memorization Across Neural Language Models https://arxiv.org/abs/2202.07646

Deduplicating Training Data Makes Language Models Better https://arxiv.org/abs/2107.06499 https://twitter.com/arankomatsuzaki/status/14154721921003397... https://twitter.com/katherine1ee/status/1415496898241339400

I have been using the rom1504 clip retrieval tool[0] up until now, but the Datasette browser[1] seems much better for Stable Diffusion users.

When my prompt isn't working, I often want to check whether the concepts I use are even present in the dataset.

For example, inputting `Jony Ive` returns pictures of Jony Ive in Datasette and pictures of apples and dolls in clip retrieval.

(I know laion 5B is not the same as laion aesthetic 6+, but that's a lesser issue.)

[0] - https://rom1504.github.io/clip-retrieval/

[1] - https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/im...

This is due to the aesthetic scoring in the UI. Simply disable it if you want precise results rather than aesthetic ones.

It works for your example

I guess I'll disable it by default since it seems to confuse people

Done https://github.com/rom1504/clip-retrieval/commit/53e3383f58b...

Using clip for searching is better than direct text indexing for a variety of reasons but here for example because it matches better what stable diffusion sees

Still interesting to have a different view over the dataset!

If you want to scale this out, you could use elastic search

I see, thanks! I didn't realize that as I thought I want to keep aesthetic scoring enabled since Stable Diffusion was trained on LAION-Aesthetics.

---

Also: There is a joke to be made at Jony's expense regarding the need to turn off aesthetic scoring to see his face.