| Hi, laion5b author here, Nice tool! You can also explore the dataset there https://rom1504.github.io/clip-retrieval/ Thanks to approximate knn, it's possible to query and explore that 5B datasets with only 2TB of local storage, anyone can download the knn index and metadata to run that locally too. Regarding duplicates, indeed it's an interesting topic! Laion5b deduplicated samples by url+text, but not by image. To deduplicate by image you need to have an efficient way to compute whether image a and b are the same. An idea to do that is to compute an hash based on clip embeddings. A further idea would be to train a network actually good at dedup and not only similarity by training on positive and negative pairs, eg with triple loss. Here's my plan on the topic https://docs.google.com/document/d/1AryWpV0dD_r9x82I_quUzBuR... If anyone is interested to participate, I'd be happy to guide them to do that. This is an open effort, just join laion discord server and let's talk. |
Quantifying Memorization Across Neural Language Models https://arxiv.org/abs/2202.07646
Deduplicating Training Data Makes Language Models Better https://arxiv.org/abs/2107.06499 https://twitter.com/arankomatsuzaki/status/14154721921003397... https://twitter.com/katherine1ee/status/1415496898241339400