Hacker News new | ask | show | jobs
by wiricon 812 days ago
How well does simulated data work in this space? My first stab at doing this scalably would be as follows: given a new product, physically obtain a single instance of the product (or ideally a 3d model, but seems like a big ask from manufacturers at this stage), capture images of it from every conceivable angle and a variety of lighting conditions (seems like you could automate this data capture pretty well with a robotic arm to rotate the object and some kind of lighting rig), get an instance mask for each image (using either human annotator or a 3d reconstruction method or a FG-BG segmentation model), paste those instances on random background images (e.g. from any large image dataset), add distractor objects and other augmentations, and finally train a model on the resulting dataset. Helps that many grocery items are relatively rigid (boxes, bottles, etc). I guess this would only work for e.g. boxes and bottles, which always look the same, you'd need a lot more variety for things like fruit and veg that are non rigid and have a lot of variety in their appearance, and we'd need to take into account changing packaging as well.
1 comments

as mentioned in another comment, "scale" is not just horizontal, it's vertical as well. with millions of products (UPCs) across different visual tolerances it's hard to generalize. your annotation method is indeed more efficient than a multistep "go take a bunch of pictures and upload them to our severs for annotators" but is still costly in terms of stakeholder buy-in, R&D, hardware costs, and indeed labor. if you can scope your verticals such that you only have, say, 1000 products the problem become feasible, but once you start to scale to an actual grocery store or bodega with ever-shifting visual data requirements the problem doesn't scale well. add in the detail that every store moves merchandise at different rates or has localized merchandise then the problem becomes even more complex.

the simulated data also becomes an issue of cost. we have to produce a realistic (at least according to the model) digital twin that doesn't interfere too much with real data, and measuring that difference is important when you're measuring the difference between Lay's and Lay's Low Sodium.

i'm not saying it's unsolvable. it's just a difficult problem