Hacker News new | ask | show | jobs
by lenzm 1731 days ago
I don't think this is very insightful. Using the first half of books for training data and the second half for testing data is still training the model specifically for these texts and authors. Not quite as bad as testing on the training data, but not great.
1 comments

Hello! OP, here. I agree that the task is training for this specific subset of English writing, which isn't ideal.

For this task, I was primarily interested in whether the task would work at all. My assumption is that given we can optimise for these texts, we could optimise for more representative datasets, too. Perhaps you think this is a weak assumption?

Do you think testing on a sample of totally different texts from different authors would be more convincing?