Hacker News new | ask | show | jobs
by PaulHoule 705 days ago
I was involved in an attempt to do this kind of thing with CNN neural networks just around the time BERT came out that was mostly successful and actually we did great projects for companies in the beverages, telecom, aviation and consumer goods space.

It worked because it also had a conventional data-processing pipeline that revolved around JSON documents.

For (2) it seems a system like that should be able to generate a script in Python, a codesigned DSL or some other language to do the conversion.

One interesting thing about the product I worked on was that it functioned as a profiler by looking at one cell at a time, so if there is some field that has "Gruff Rhys" or "范冰冰" it could tell that was probably somebody's name, all the better if it can also see the field label is something like "Full Name" or "姓名". I'd contrast that to more conventional column-based profilers who might noticed that a certain field only has the values "true" and "false" throughout the whole column and would probably have some rule that would determine it was a boolean field.

One thing that system could do is recognize private data inside unstructured data. Where I work for instance we have

https://www.spirion.com/sensitive-data-discovery

which scans text and other files and it warns if it sees something like a lot of personal data, like an Excel spreadsheet full of names, addresses and phone numbers -- even if I just made them up as test data.