Hacker News new | ask | show | jobs
Projell.com – Simple APIs for synthetic data generation
6 points by sumitsrivastava 2446 days ago
Hi, I'm Sumit Srivastava, founder of Projell.com . We made this after dealing with the data hell like low data availability, high data procuring cost, huge time sink for data collection, and privacy concerns over the user data.

This prompted me to build an easy way to generate synthetic data for machine learning models. This primarily uses GANs, but we use techniques which are most efficient for specific usecases.

Areas where we've found it useful are biomedical, drone imagery, satellite imagery, retail, and autonomous mobility.

As already prominent in the ImageNet challenge, the state of the art is using synthetic data to gain higher accuracy. [ https://paperswithcode.com/sota/image-classification-on-imagenet ]

Google, for their autonomous vehicles, used millions of miles of real driving data and billions of miles of synthetic data. It is clear where the world is moving towards.

I would be happy to share the tools with everyone since dealing with data is something we struggled with and don't want anyone to struggle anymore. This is probably only the first step towards building something robust that can reduce as much data hassles as possible, if not all.

3 comments

Hey, interesting idea! How do you plan to deal with the sparce dataset in models? Let's suppose I have a biomedical dataset of 200 images, is there a minimum dataset requirement?
Hi, Farmify! This is a great question. I have designed it with medical data (among other challenging verticals) in mind too. Infact, one of the medical companies we're working with has exactly the same problem as yours: sparse data. High cost and privacy is deflecting them from what can be said as having better ML models fast. For them, we see what can be the generalised anomalies and scale those anomalies up to a background dataset. This helps us take care of the the limitations of the datasets they already have with them.

Not a minimum dataset requirement for now, but yeah- the more the merrier.

Hi folks, OP here! Happy to answer any questions you might be wondering about.
Hi! What kind of data are you planning to generate? Is it images only?