Hacker News new | ask | show | jobs
by PaulDavisThe1st 498 days ago
Wow.

When we started Amazon, this was precisely what I wanted to do, but using Library of Congress triple classifications instead of ISBN.

It turned out to be impossible because the data provider (a mixture of Baker & Tayler (book distributors) and Books In Print) munged the triple classification into a single string, so you could not find the boundaries reliably.

Had to abandon the idea before I even really got started on it, and it would certainly have been challenging to do this sort of "flythrough" in the 1994-1995 version of "the web".

Kudos!

2 comments

What are you referring to as the LoC triple classification?

I've spent quite some time looking at both the LoC Classification and the LoC Subject Headings. Sadly the LoC don't make either freely available in a useful machine-readable form, though it's possible to play games with the PDF versions. I'd been impressed by a few aspects of this, one point that particularly sticks in my mind is that the state-law section of the Classification shows a very nonuniform density of classifications amongst states. If memory serves, NY and CA are by far the most complex, with PA a somewhat distant third, and many of the "flyover" states having almost absurdly simple classifications, often quite similar. I suspect that this reflects the underlying statutory, regulatory, and judical / caselaw complexity.

Another interesting historical factoid is that the classification and its alphabetic top-level segmentation apparently spring directly from Thomas Jefferson's personal library, which formed the origin of the LoC itself.

For those interested, there's a lot of history of the development and enlargement of the Classification in the annual reports of the Librarian of Congress to Congress, which are available at Hathi Trust.

Classification: <https://www.loc.gov/catdir/cpso/lcco/>

Subject headings: <https://id.loc.gov/authorities/subjects.html>

Annual reports:

- Recent: <https://www.loc.gov/about/reports-and-budgets/annual-reports...>

- Historical archive to ~1866: <https://catalog.hathitrust.org/Record/000072049>

Never knew about LoC book Classification till now; based on what I read I'd call it a failed US-wide attempt to standardize US collections (not international ones). Neat as it is, it's not free to access ($; why??), it's not used outside US(/Canada) and it's not used as standard by US booksellers or libraries, and it's anglocentric as noted in [0] (an alternative being Harvard–Yenching Classification, for Chinese books). Also that's disappointing you say that the states vary greatly in applying that segmentation.

[0]: https://en.wikipedia.org/wiki/Library_of_Congress_Classifica...

The LoC classifications are, so far as I'm aware, free from distribution restrictions as works of the US government under copyright, and to that extent it's legal to distribute them for free.

However the LoC doesn't provide machine-readable data for free so far as I'm aware.

You can acquire the entire Classification and Subject Headings as PDF files (also WordPerfect (!!!) and MS Word, possibly some other formats), though that needs some pretty complex parsing to convert to a structured data format.

(I've not tried the WP files, though those might be more amenable to conversion.)

As far was "why", presumably some misguided government revenue-generating and/or business-self-interest legislation and/or regulation, e.g., library service providers who offer LoC Class/SH data, who prefer not to have free competition. (I'm speculating, I don't know this for a fact, though it seems fairly likely.)

But you can't access Classification Web without a $$$ subscription plan! (from $375 for Single User up to $1900 for 26+ Concurrent Users).

https://www.loc.gov/cds/

https://www.loc.gov/cds/classweb/

(Aaron Swartz would object. You can access US patent data for free, but not LoC Classification Web)

Pretty much my point.

I should look into the terms/conditions for that.

People won't care why it isn't freely accessible, it's not going to displace ISBN (or other non-US classifications) without that; not even inside the US, and certainly not outside it.

I'm actually genuinely surprised it isn't freely accessible; Aaron Swartz (RIP) might have gone to war over that, and might have won that war in the court of public opinion.

Hey has anyone in the govt trained an LLM on it? Given title, author, keywords, abstract, etc. predict which LoC triple classification (or Dewey Decimal Classification, or Harvard–Yenching Classification, or Chinese Library Classification, or New Classification Scheme for Chinese Libraries (NCL in Taiwan)) a book would have? That would be neat, and a good way to proliferate its use instead of ISBN. (But the US govt would still assert copyright over the LoC classification.)

Answering separately:

...I'd call it a failed US-wide attempt to standardize...

The LOC Classification is a system for organising a printed (and largely bound) corpus on physical shelves/stacks. That is, any given document can occupy at most one and only one location, and two or more documents cannot occupy the same location, if they comprise separate bound volumes or other similar formats (audio recordings, video recordings, maps, microfilm/microfiche, etc.).

For digital records, this constraint isn't as significant (your filesystem devs will want to ensure against multiple nonduplicate records having the same physical address, but database references and indices are less constrained).

The Subject Headings provide ways of describing works in standardised ways. Think of it as strongly similar to a tagging system, but with far more thought and history behind it. ("Folksonomy" is a term often applied to tagging systems, with some parts both of appreciation and frustration.)

Where a given work has one and only one call number fitting within the LoC classification, using additional standardised classifications such as Cutter Codes, author and publication date, etc., works typically have multiple Subject Headings. Originally the SH's were used to create cross-references in physical card catalogues. Now they provide look-up affordances in computerised catalogues (e.g., WorldCat, or independently-derived catalogues at universities or the Library of Congress itself). You'll typically find a list of LoC SH's on the copyright page of a book along with the LoC call number.

Back to the Classification: there are many criticisms raised about LoC's effort, or others (e.g., Dewey Decimal, which incidentally is not free and is subject to copyright and possibly other IP, with some amusing case history). What critics often ignore is that classifications specifically address problems of item storage and retrieval, and as such, are governed by what is within the collection, who seeks to use that material, and how. In the case of state legal classifications, absent further experience with both that section of the classification (section K of the LoC Classification) and works within it, I strongly suspect that the complexity variation is a reflection of the underlying differences in state law (as noted above) and those wishing to reference it. That is, NY, CA, and PA probably have far greater complexity and far more demanding researchers, necessitating a corresponding complexity of their subsections of that classification, than do, say, Wyoming, North Dakota, and South Dakota (among the three smallest sections of state law by my rather faltering recollection).

Peculiarities of both the Dewey and LoC classifications, particularly in such areas as history (LoC allocates two alphabetic letters, E and F respectively, to "History of the Americas), geography, religion, etc. In the case of Dewey, Religion (200) is divided into General Religion (200--209), Philosophy and Theory of Religion (210--219), then the 220s, 230s, 240s, 250s, 260s, 270s, and 280s to various aspects of Christianity. All other religions get stuffed into the 290s. Cringe.

Going through LoC's Geography and History sections one finds some interesting discontinuities particularly following 1914--1917, 1920, 1939--1945, and 1990. There are of course earlier discontinuities, but the Classification's general outline was specified in the early 1800s, and largely settled by the late 1800s / early 20th century. Both the Classification and Subject Headings note many revisions and deprecated / superseding terms. Some of that might attract the attention of the present Administration, come to think of it, which would be particularly unfortunate.

The fact that the LoC's Classification and SH both have evident and reasonably-well-functioning revision and expansion processes and procedures actually seems to me a major strength of both systems. It's not an absolute argument for their adoption, but it's one which suggests strong consideration, in addition to the extant wide usage, enormous corpus catalogued, and supporting infrastructure.

30 years ago, I knew barely any more about library science than I do know, and I know basically nothing now. The idea was that of a dewy eyed (pun intentional) idealist who wanted to build an online experience similar to wandering into the gardening section at <your favorite large bookstore> and then dialing down the water garden part and then the japanese water garden part.

> the LOC Classification is a system for organising a printed (and largely bound) corpus on physical shelves/stacks. That is, any given document can occupy at most one and only one location, and two or more documents cannot occupy the same location,

The last part of this is not really true. The LoC classification does not identify a unique slot on any shelf, bin or other storage system. It identifies a zone or region of such a storage system where items with this classification could be placed. There can be as many books on Japanese water gardening as authors care to produce - this has no impact on the classification of those books. The only result of the numbers increasing is that some instances of a storage system that utilized this classification (e.g. some bookstores) would need to physically grow.

The Classification doesn't establish unique positions, no, but it serves as the backbone on which those unique call numbers are generated. First the subject and sub-subject classifications, then specific identifiers generally based on title, author, and publication date.

But the detail of the Classification serves the needs and interests of librarians and readers in that you'll, for fairly obvious reasons, need more detail where there are more works, less where there are fewer, and of course changes to reality, as any good contributing editor to The Hitchiker's Guide to the Galaxy (the reference, not D. Adam's charming account loosely linked to it) can tell you, play havok with pre-ordained organisational schemes.

The LoC Classifiction itself is itself only one of these. There are other library classifications, as well as a number of interesting ontologies dating back to Aristotle and including both Bacons, Diderot, encyclopedists of various stripes, and more.

> What are you referring to as the LoC triple classification?

Lines of actually working code; lines of commented-out inactive code; lines of explanatory comments. HTH!

Naah, gotcha, the other "LoC"... But only got it on about the third occurrence.

> a mixture of Baker & Tayler (book distributors)

Having dealt with Baker & Taylor in the past, this doesn't surprise me in the least. It was one of the most technologically backwards companies I've ever dealt with. Purchase orders and reconciliations were still managed with paper, PDFs, and emails as of early 2020 (when I closed my account). I think at one point they even had me faxing documents in.

A bit tangential but one of my favorite early amzn stories is when a small group from Ingram (at the time, the other major US book distributor) came to visit us in person (they were not very far away ... by design).

It was clear that they were utterly gobsmacked that a team of 3 or 4 people could have done what we have done in the time that we had done it. They had apparently contemplated getting into online retail directly, but saw two big problems: (a) legal and moral pushback from publishers who relied on Ingram just being a distributor (b) the technological challenge. I think at the time their IT staff numbered about 20 or so. They just couldn't believe what they were seeing.

Good times (there weren't very many of those for me in the first 14 months) :)