Depends on the licenses of said projects. It's pretty easy to whitelist MIT/Apache licensed code.
Even still, it's unclear whether an abstract machine learning model that "saw" copyrighted IP during the training process is infringing on said IP. A self-driving car sees many billboards during test rides.
It's basically an extension of "everything is a derivative work". I believe it's similar to how you cannot (and should not be able to) erase your employee's brains of the code they worked on.
When you include the source code in a project? Or distribute binaries created in whole or in part from the source code?
The licenses say exactly what the requirements are, and when you are obligated to follow those requirements. What do they say about training AI models on the code?
Think of this more along the lines of when you read a code description of how something works and then describe it to someone else. In the same way that you don't have to provide copies of the license(s) of whatever you read, you shouldn't have to do that here, because the model is simplying learning and then inferring.
It's not as simple as that though. What if I train an AI to output a book given its title. With a sufficiently expressive and overfit network it could actually memorise the training data exactly.
That means my network now includes an exact copy of those books (encoded in a weird way but that is irrelevant). Should I be able to distribute my weights and ignore the copyright on the original books? Of course not!
In this case it is very unlikely that the network weights store much data from each individual github project, but it's definitely a sliding scale.
Even still, it's unclear whether an abstract machine learning model that "saw" copyrighted IP during the training process is infringing on said IP. A self-driving car sees many billboards during test rides.