Hacker News new | ask | show | jobs
by smcin 500 days ago
Never knew about LoC book Classification till now; based on what I read I'd call it a failed US-wide attempt to standardize US collections (not international ones). Neat as it is, it's not free to access ($; why??), it's not used outside US(/Canada) and it's not used as standard by US booksellers or libraries, and it's anglocentric as noted in [0] (an alternative being Harvard–Yenching Classification, for Chinese books). Also that's disappointing you say that the states vary greatly in applying that segmentation.

[0]: https://en.wikipedia.org/wiki/Library_of_Congress_Classifica...

2 comments

The LoC classifications are, so far as I'm aware, free from distribution restrictions as works of the US government under copyright, and to that extent it's legal to distribute them for free.

However the LoC doesn't provide machine-readable data for free so far as I'm aware.

You can acquire the entire Classification and Subject Headings as PDF files (also WordPerfect (!!!) and MS Word, possibly some other formats), though that needs some pretty complex parsing to convert to a structured data format.

(I've not tried the WP files, though those might be more amenable to conversion.)

As far was "why", presumably some misguided government revenue-generating and/or business-self-interest legislation and/or regulation, e.g., library service providers who offer LoC Class/SH data, who prefer not to have free competition. (I'm speculating, I don't know this for a fact, though it seems fairly likely.)

But you can't access Classification Web without a $$$ subscription plan! (from $375 for Single User up to $1900 for 26+ Concurrent Users).

https://www.loc.gov/cds/

https://www.loc.gov/cds/classweb/

(Aaron Swartz would object. You can access US patent data for free, but not LoC Classification Web)

Pretty much my point.

I should look into the terms/conditions for that.

People won't care why it isn't freely accessible, it's not going to displace ISBN (or other non-US classifications) without that; not even inside the US, and certainly not outside it.

I'm actually genuinely surprised it isn't freely accessible; Aaron Swartz (RIP) might have gone to war over that, and might have won that war in the court of public opinion.

Hey has anyone in the govt trained an LLM on it? Given title, author, keywords, abstract, etc. predict which LoC triple classification (or Dewey Decimal Classification, or Harvard–Yenching Classification, or Chinese Library Classification, or New Classification Scheme for Chinese Libraries (NCL in Taiwan)) a book would have? That would be neat, and a good way to proliferate its use instead of ISBN. (But the US govt would still assert copyright over the LoC classification.)

ISBN and LoC Classification serve totally different purposes.

ISBN identifies a specific publication, which may or may not be a distinct work. In practice, a given published work (say, identified by an author, title, publication date, and language) might have several ISBNs associated with it, for trade hardcover, trade paperback, library edition, large print, Braille, audio book, etc. On account of how ISBNs are issued, the principle organisation is by country and publisher. This also means that the same author/title/pubdate tuple may well have widely varying ISBNs for, say, US, Canadian, UK, Australian, NZ, and other country's version of the same English-language text.

There are other similar identifiers such as the LoC's publication number (issued sequentially by year), the OCLC's identifier, or (for journal publications) DOI. Each of these simply identify a distinct publication without providing any significant classification function.[1]

The LoC Classification, as the name suggests, organises a book within a subject-based ontology. Whilst different editions, formats, and/or national versions of a book might have distinct LoC Classifications, those will be tightly coupled and most of the sequence will be shared amongst those books. The LoC Classification can be used to identify substantively related material, e.g., books on economics, history, military science, religion, or whatever, in ways which ISBN simply cannot.

As I've noted, the Classification is freely available, as PDFs, WordPerfect, and MS Word files, at the URLs I'd given previously. Those aren't particularly useful as machine-readable structured formats, however.

________________________________

Notes:

1. Weasel-word "significant" included as those identifiers provide some classification, but generally by year, publisher, publication, etc., and not specifically classifying the work itself.

Answering separately:

...I'd call it a failed US-wide attempt to standardize...

The LOC Classification is a system for organising a printed (and largely bound) corpus on physical shelves/stacks. That is, any given document can occupy at most one and only one location, and two or more documents cannot occupy the same location, if they comprise separate bound volumes or other similar formats (audio recordings, video recordings, maps, microfilm/microfiche, etc.).

For digital records, this constraint isn't as significant (your filesystem devs will want to ensure against multiple nonduplicate records having the same physical address, but database references and indices are less constrained).

The Subject Headings provide ways of describing works in standardised ways. Think of it as strongly similar to a tagging system, but with far more thought and history behind it. ("Folksonomy" is a term often applied to tagging systems, with some parts both of appreciation and frustration.)

Where a given work has one and only one call number fitting within the LoC classification, using additional standardised classifications such as Cutter Codes, author and publication date, etc., works typically have multiple Subject Headings. Originally the SH's were used to create cross-references in physical card catalogues. Now they provide look-up affordances in computerised catalogues (e.g., WorldCat, or independently-derived catalogues at universities or the Library of Congress itself). You'll typically find a list of LoC SH's on the copyright page of a book along with the LoC call number.

Back to the Classification: there are many criticisms raised about LoC's effort, or others (e.g., Dewey Decimal, which incidentally is not free and is subject to copyright and possibly other IP, with some amusing case history). What critics often ignore is that classifications specifically address problems of item storage and retrieval, and as such, are governed by what is within the collection, who seeks to use that material, and how. In the case of state legal classifications, absent further experience with both that section of the classification (section K of the LoC Classification) and works within it, I strongly suspect that the complexity variation is a reflection of the underlying differences in state law (as noted above) and those wishing to reference it. That is, NY, CA, and PA probably have far greater complexity and far more demanding researchers, necessitating a corresponding complexity of their subsections of that classification, than do, say, Wyoming, North Dakota, and South Dakota (among the three smallest sections of state law by my rather faltering recollection).

Peculiarities of both the Dewey and LoC classifications, particularly in such areas as history (LoC allocates two alphabetic letters, E and F respectively, to "History of the Americas), geography, religion, etc. In the case of Dewey, Religion (200) is divided into General Religion (200--209), Philosophy and Theory of Religion (210--219), then the 220s, 230s, 240s, 250s, 260s, 270s, and 280s to various aspects of Christianity. All other religions get stuffed into the 290s. Cringe.

Going through LoC's Geography and History sections one finds some interesting discontinuities particularly following 1914--1917, 1920, 1939--1945, and 1990. There are of course earlier discontinuities, but the Classification's general outline was specified in the early 1800s, and largely settled by the late 1800s / early 20th century. Both the Classification and Subject Headings note many revisions and deprecated / superseding terms. Some of that might attract the attention of the present Administration, come to think of it, which would be particularly unfortunate.

The fact that the LoC's Classification and SH both have evident and reasonably-well-functioning revision and expansion processes and procedures actually seems to me a major strength of both systems. It's not an absolute argument for their adoption, but it's one which suggests strong consideration, in addition to the extant wide usage, enormous corpus catalogued, and supporting infrastructure.

30 years ago, I knew barely any more about library science than I do know, and I know basically nothing now. The idea was that of a dewy eyed (pun intentional) idealist who wanted to build an online experience similar to wandering into the gardening section at <your favorite large bookstore> and then dialing down the water garden part and then the japanese water garden part.

> the LOC Classification is a system for organising a printed (and largely bound) corpus on physical shelves/stacks. That is, any given document can occupy at most one and only one location, and two or more documents cannot occupy the same location,

The last part of this is not really true. The LoC classification does not identify a unique slot on any shelf, bin or other storage system. It identifies a zone or region of such a storage system where items with this classification could be placed. There can be as many books on Japanese water gardening as authors care to produce - this has no impact on the classification of those books. The only result of the numbers increasing is that some instances of a storage system that utilized this classification (e.g. some bookstores) would need to physically grow.

The Classification doesn't establish unique positions, no, but it serves as the backbone on which those unique call numbers are generated. First the subject and sub-subject classifications, then specific identifiers generally based on title, author, and publication date.

But the detail of the Classification serves the needs and interests of librarians and readers in that you'll, for fairly obvious reasons, need more detail where there are more works, less where there are fewer, and of course changes to reality, as any good contributing editor to The Hitchiker's Guide to the Galaxy (the reference, not D. Adam's charming account loosely linked to it) can tell you, play havok with pre-ordained organisational schemes.

The LoC Classifiction itself is itself only one of these. There are other library classifications, as well as a number of interesting ontologies dating back to Aristotle and including both Bacons, Diderot, encyclopedists of various stripes, and more.