Hacker News new | ask | show | jobs
by chaps 1219 days ago
Have you been able to get something that might match a relational database? Auto-generation of a relational schema from a large dataset, or multiple datasets, is a deeply interesting idea.
2 comments

Would you really pay for this? I made one for a client analyzing some X million line csvs by sharding the records then computing across 500 lambda instances to arrive at a schema
Sounds lovely, but not something I would be able to do for reasons that I won't go into here. I would love to hear how you do it, though I'd understand if you can't or won't share.
It was straight forward. Basically with the csv's I processed, there were a few sets of columns that would make natural tables, i.e. addresses that would create natural 'Street Name', 'City', 'Zip' Tables.

The first step was just sharding the csv's into 200 lines or so with a naming schema. Then storing those in S3.

Then the following was creating a manager Lambda that would send out the schema creation task to however many Lambda concurrent instances I could run, and hold the shards until instances freed up. So it allowed me to do the inefficient work in the individual Lambda instance.

The inefficient work was determining how many uniques were in the dataset with a threshold. As in if there were 50 or more uniques, then consider that column 'dead' to becoming a schema. However, if it were less, filter it up to the main manager thread.

Then the main manager thread would binary dispatch the results to combine the results into a main schema with another unique value threshold.

Overall it was a fun learning experience. The hardest part was removing the download time to get the full file into the lambda... I never got that right. The processing was dang fast, the latency to get the rows into the individual threads never got 'fast fast'

---

Good luck! Cheers!

We thought the same solution for YoBulk but intentionally did not use S3 and lambda because all cloud infra cost and maintenance cost associated with this solution.You can find the details in our blog https://www.yobulk.dev/blog/Scaling%20a%20CSV%20importer
Ha! Sweet - well done to you and the team.

Hope I get back on CSV projects again soon to try your stuff out. Cheers

Ohh Yes..You are spot on. It's there in our upcoming release.Stay tuned please.