Hacker News new | ask | show | jobs
Show HN: YoBulk AI – Open Source React SDK for data cleansing (yobulk.dev)
20 points by yochin 1200 days ago
Hey Everyone,

We are excited to show YoBulk AI https://github.com/yobulkdev/yobulkdev an open source alternative react SDK for data cleansing (CSVs).

CSV files are the most common data format for storing and exchanging data. They are often used in SaaS (Software as a Service) tools for data management, such as CRM (Customer Relationship Management) systems, marketing automation platforms, and data analytics software. Data cleaning is a crucial process in data management, especially when dealing with CSV files. Here are some of the problems that can arise during data cleaning: Inconsistent formatting: CSV files can have inconsistent formatting, which can make it difficult to process the data. For example, different columns might have different date formats, or text might be capitalized differently. Missing data: CSV files may contain missing data, which can be problematic when trying to perform analyses or generate reports. It is important to identify and fill in missing data as accurately as possible. Duplicate data: Duplicates can occur when data is entered or imported multiple times, resulting in inaccurate analysis and reporting. SaaS tools must identify and remove duplicate data to ensure accurate insights. Incorrect data: Sometimes, the data in CSV files is simply incorrect. This can be due to human error, incorrect data entry, or issues with the data source. It is essential to identify and correct such errors to ensure data integrity. Non-standardized data: CSV files may contain non-standardized data, such as inconsistent or inaccurate labels, which can make it difficult to process and analyze data. It is essential to standardize data labels and ensure data accuracy to avoid confusion and inaccuracies in reports. At YoBulk we are trying to address the above problem using open source and AI (OpenAI at the moment) that allows developers to create embeddable CSV buttons in their web applications which they can easily preset with validation rules in the matter of just a few clicks. It also allows business users to upload third party CSVs, collaboratively validate and cleanse the data all with our GPT powered data mapping and data cleansing. Plus YoBulk is completely free and open source as well. Please be aware that this is a Beta Release, and therefore we will be clearing the data which is not being used periodically (biweekly).

This release offers several significant features, including

You signup using Google Auth, Github and also your email.. There is also a new onboarding flow and free access to YoBulk's AI features. You can find everything else that is available on the docker or Developer mode of YoBulk.

YoBulk has created a React Software Development Kit (SDK) and a Sample Import Button App that can be embedded in your React App. As a developer, you can generate an Import ID using YoBulk and then incorporate it into your React application. To access these resources, please visit the following links: https://github.com/yobulkdev/yoembed-react-sdk https://github.com/yobulkdev/yoembed-sample-react-app

Hosting and Deployment:

1.Cloud https://cloud.yobulk.dev/

2.Self Hosting YoBulk can be self hosted and currently running on Mongo. Github : git clone git@github.com:yobulkdev/yobulkdev.git

Getting started is really simple :

Please refer https://doc.yobulk.dev/GetStarted/Installation

Docker command: git clone https://github.com/yobulkdev/yobulkdev.git cd yobulkdev docker-compose up -d Or docker run --rm -it -p 5050:5050/tcp yobulk/yobulk Or git clone https://github.com/yobulkdev/yobulkdev cd yobulkdev yarn install yarn run dev

Also please join our community at :

- Github : https://github.com/yobulkdev/yobulkdev - Slack : https://join.slack.com/t/yobulkdev/signup. - Twitter : https://twitter.com/YoBulkDev - Reditt : https://reddit.com/r/YoBulk

Would love to hear your feedback & how we can make this better.

Thank you, Team YoBulk

6 comments

This HN show covers two major announcement and feature from YoBulk.

1.YoBulk is cloud ready now - https://cloud.yobulk.dev/ for product managers and customer success teams to check YoBulk features without installing docker or cloning the code base.

2.The React SDK is ready to use to embed an import button in any react app. - https://github.com/yobulkdev/yoembed-react-sdk and - https://github.com/yobulkdev/yoembed-sample-react-app

And for hosted solution, is there a provision to configure our OpenAI Key? Or every OpenAI interaction goes through YoBulk's Open AI Key?
Right now all open AI interaction go through YoBulk open AI keys.In future, we are planning it to be exposed through an ENV variable to make it configurable.
Will it be possible to carry forward context? Like if patterns in my CSV files are persistent, my corrections will be repetitive in nature & can be learnt along.

Looks to be a good tool overall! Where can I find roadmap?

Yes definitely, we are building a Reinforcement learning from Human Feedback based model to save the context based on user's input.So the model will keep on learning form user's correction.

YoBulk roadmap is available here:https://github.com/orgs/yobulkdev/projects/2/views/1 please have a look

Is GPT3 used to process the data itself? As opposed to something like automatically generating validation rules. If yes, how does this avoid creating plausible but incorrect data? e.g. the blurb mentions "It is important to identify and fill in missing data as accurately as possible", which sounds like a fantasy.
No GPT3 is used to derive the context of the data only.Though YoBulk you can create validation schemas."It is important to identify and fill in missing data as accurately as possible".This statement is valid.YoBulk is creating a data lineage between each cell.It self learn the previous action by the user and try fill the missing data based on it's previous learning.
What are we using OpenAI API for? Is it only for present & validation rules creation? Or even to cleanse data? If it's being used for latter, what about data-privacy concerns?
OpenAI key is used to use the LLM models of openAI and get the context of each column of a CSV file.Using that context, right now we can able do smart column matching i.e EX:Matching Date of birth with Age.Even we can identify if the age is more than 150 years without any validation rule.As the AI knows that human can not be more than 150.The cloud hosted solution will be used by SaaS or enterprises to onboard customer data.YoBulk will be using SaaS DB for any data processing only.For AI use-cases, we have plans to build our LLM models which can be deployed with minimal infra cost at SaaS application's premise.
I would love to see this as a Docker Extension. If I find time, I will create an extension out of it so that it can be setup with a single click.
Yes you can consider the cloud hosted solution as an extension to docker.Thanks for exploring.
Very cool