Hacker News new | ask | show | jobs
by jayers 50 days ago
It's funny: publishing work offline in books and magazines is perhaps more anonymous in the age of AI.

I pasted in a number of passages from books on my bookshelf. Predictably, stuff that I read for my English degree in university is largely in the training data and easily identifiable. Stuff from regional authors or is slightly adjacent to the cultural mainstream makes no impression.

2 comments

To clarify, because a number of posts here sort of suggest the confusion:

the article here isn't about the LLM recognizing works that were in the training data. EG, The Old Man and the Sea off the shelf. It's about pegging the author of novel texts, like, say, some letter written by Hemmingway that gets discovered next week and was never before digitized.

Yes, that makes sense. However, unless there's a significant corpus of an author in the training data it won't recognize them. One of the author's that I fed into Claude was a passage from the book Leepike Ridge by ND Wilson. Wilson has written online and in print quite a bit, but Claude couldn't guess the author and guessed that it was a passage from a noir crime novel.

Wilson is a fairly idiosyncratic writer with a distinct style, yet even still Claude couldn't guess correctly from a currently published book.

I suspect that what's going on here (like other's are suggesting in this thread) is that Claude is in some way biased towards certain sets of authors by its training.

It is for now.

But I'm sure the scanning operations will start scouring the earth even harder for any books unaffected by slop containing niche knowledge and text in order for their models to have an edge over the ones trained only on pirate collections and the Internet.

I wonder if secondhand bookshops and deceased estates are seeing bulk buyers of their stock suddenly appearing. Maybe broke governments/municipalities will start selling them entire libraries and archives to ingest.