Hacker News new | ask | show | jobs
by 2swarovsky 4196 days ago
seems an interesting project, but I don't understand why the examples are splitted into "training" and "validation": sometimes the regex doesn't extract correctly all the strings and I suspect this is due to the dataset splitting.
1 comments

Splitting the learning set in training and validation sets is very important. The validation set is used in order to select the solutions which have generalized (or understood) the problem for real. When you use all the knowledge for training, the algorithm can overfit, providing a solution that has a great performance on the training examples but has poor performance when you use it, for real, on unseen text. Splitting in training and validation leads to better solutions.