|
|
|
|
|
by Tananon
240 days ago
|
|
True, I think that's also a great usecase! Though these algorithms likely won't scale to very large datasets (e.g. millions of samples), but for smaller datasets, like fine-tuning sets, I think this would work very well. I've worked on something similar in the past that works for larger datasets (semantic deduplication: https://github.com/MinishLab/semhash). |
|