| Not necessarily. This also uncovers the weakness of the NYT lawsuit. Imagine in your corpus of training data is the following: - bloga.com: "I read in the NYT that 'it rains cats and dogs twice per year'" - blogb.com: "according to the NYT, 'cats and dogs level rain occurs 2 times per year." - newssite.com: "cats and dogs rain events happen twice per year, according to the New York Times" Now, you chat with an LLM trained on this data, asking it "how many times per year does it rain cats and dogs?" "According to the New York Times, it rains cats and dogs twice per year." NYT content was never in the training data, however it -is- mentioned a lot on various sources throughout commoncrawl-approved sources, therefore gets a higher probability association with next token. Zoom that out to full articles quoted throughout the web, and you get false positives. |