Hacker News new | ask | show | jobs
by pona-a 63 days ago
I feel like normalization would be a nightmare. Consider all the mistranscriptions, OCR errors, and different names in the libraries (case, parentheticals, etc).

If we assume there's no reliable way to define a book, maybe locally sensitive hashing could help find probably same books.

The idea is pretty cool though.

1 comments

Good point. Normalization is deliberately scoped to 'what a human reads off the title page' rather than reconciling all possible metadata sources. LSH as a complementary fuzzy-matching layer for catalog reconciliation is exactly what the planned resolver at openusbn.org is designed to support: deterministic identifier as the anchor, probabilistic matching as the discovery tool.