| The results will be "bad", which you have acknowledged as a possibility. Why do it then? LLMs trained "merely" on just either wikipedia or reddit are probably going to be very limited in capability since there's not enough well rounded data (esp. for wikipedia). Of course you'll find differences. Reddit is going to contain more profanity, at the very least, so the reddit-trained LLM is going to swear and use slang more. Besides generating gibberish and comparing the gibberish doesn't seem to be any point with the exercise, unless that's a project you really want to do. Without knowing how IB scores students' research papers I wouldn't be able to comment on whether this is feasible to get reasonable grades, but as I said, unless you really want to do it and somehow measure the reddit model understanding slang better and swearing more readily, I personally don't see a point in doing so given that the results will likely, as you mentioned, be somewhat "bad". The thing about bleeding edge research on LLMs is that nobody really knows what will happen unless you actually try it out. FWIW you generally don't have to do much proper "programming" to train models these days. There are many projects on github with code to train SoTA models (which in turn are just hundreds or low-thousands lines of code). The main difficulty is getting the hardware, the OS and the dependencies to work correctly, getting high quality training data (which you don't have to for your project), and tuning the hyperparameters (if you're concerned with performance). So in terms of technical feasibility, yeah, but I am kind of concerned that the most likely main result would be reddit's knowledge of internet slang and swearing over wikipedia, which doesn't seem to mesh well with a high school project :D |
so that's the perfect reason to do it. You made predictions about the differences, then said "don't know how this will come out", and that's the scientific method right there.
Another interesting thing to find out in this experiment is not only "what would be the differences between a reddit education vs a wikipedia education" but what would be the similarities? How would it answer ethics questions? How would it answer history questions, etc.
OP it sounds like an interesting project! but I'm not in a position to judge its feasibility to get useful results.