| It really is rather peculiar to me. They frame it like this (emphasis mine): "With the preliminary publication of this dataset, we further seek to establish a community-led process to grow, improve, and use institutional data in ways that strengthen the knowledge ecosystem and assert the importance of ongoing stewardship of training data from the originating knowledge institutions themselves. To this end, we are experimenting to find the best way to release this data in a manner that facilitates collaboration. We encourage input on this process to guide the full publication of this and future dataset dataset releases, beginning with the following decisions: * At preliminary launch, we have published the metadata, including experimental metadata, in full
for anyone to access and use. * At preliminary launch, we have published the dataset including OCR-extracted text under a
noncommercial license, and with a 'click-through' that requires users to accept this license,
additional terms of use, and to share basic contact information with us so that we can engage the community in its early use. * At preliminary launch, we have chosen to postpone the release of the raw scan images, though we
will share them liberally with researchers and libraries who wish to review them. While we know
AI developers and researchers are eager for more raw materials, we believe this minor friction
can help build the relationships and norms necessary to grow a collaborative community." It is the fruit of their labour (well, the digitalisation is), so it is up to them to license it as they see fit. But it feels odd to me that they seem to want to be in control to this degree. In open source and my own research field, the pattern we tend to follow is to release freely, observe, and then build relationships rather than holding a "license gun" towards the head of potential collaborators. Lastly, I have only skimmed the pre-print, but I noted no commitment to a final license either. Not even a direction for it. Thus, as a natural language processing researcher I will stay clear from this dataset for the time being and hope the licensing situation improves. |