Hacker News new | ask | show | jobs
by comfymatrix 2444 days ago
Out of curiosity, what stopped you from making several regexes, especially for the product numbers (ending in "-ND")? Was it mainly for edge cases or cases where it might have been slightly different?

Or, was it due data context, which I assume is more plausible. I ask because I maintain an old, slightly large, and growing project which contains about 12 different regexes and is used on messy unstructured data. I'm in the process of rewriting it into a more general framework using NER + RNNs or HMMs, but this seems like a very interesting approach.

1 comments

So a regex is used for the weak classifiers, just they are not trusted. So you say a thing ending in -ND is 90% chance of being a part number. So you have regexes with wiggle room. Then u dictate hard knowledge that table headers must be above data, and then the MIP solver has the freedom to override the regexes classifiers with knowledge from elsewhere. This works well if you have some really strong top level structural knowledge.