| HN Mirror

What I'm wrangling with is this:

I agree that a particular sequence of words is copyrightable.

What I'm struggling with is that facts _about_ that corpus of text are not copyrightable. A simple fact could be that the word "bar" is the 5th word. The 6th word is "jazz". Etc.

A model is trained from these "facts" across many source documents. It is thus itself a derived 'fact' given a set of training inputs and parameters, so then how could _that_ then be copyrighted?

Put another way - there's the origin text and then.. is it turtles all the way down and none of it can be copyrighted because its all math and calculations derived from that?