| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chaps 1219 days ago
	Have you been able to get something that might match a relational database? Auto-generation of a relational schema from a large dataset, or multiple datasets, is a deeply interesting idea.

2 comments

anonymouse008 1219 days ago

Would you really pay for this? I made one for a client analyzing some X million line csvs by sharding the records then computing across 500 lambda instances to arrive at a schema

link

chaps 1219 days ago

Sounds lovely, but not something I would be able to do for reasons that I won't go into here. I would love to hear how you do it, though I'd understand if you can't or won't share.

link

anonymouse008 1219 days ago

It was straight forward. Basically with the csv's I processed, there were a few sets of columns that would make natural tables, i.e. addresses that would create natural 'Street Name', 'City', 'Zip' Tables.

The first step was just sharding the csv's into 200 lines or so with a naming schema. Then storing those in S3.

Then the following was creating a manager Lambda that would send out the schema creation task to however many Lambda concurrent instances I could run, and hold the shards until instances freed up. So it allowed me to do the inefficient work in the individual Lambda instance.

The inefficient work was determining how many uniques were in the dataset with a threshold. As in if there were 50 or more uniques, then consider that column 'dead' to becoming a schema. However, if it were less, filter it up to the main manager thread.

Then the main manager thread would binary dispatch the results to combine the results into a main schema with another unique value threshold.

Overall it was a fun learning experience. The hardest part was removing the download time to get the full file into the lambda... I never got that right. The processing was dang fast, the latency to get the rows into the individual threads never got 'fast fast'

---

Good luck! Cheers!

link

yosai 1219 days ago

We thought the same solution for YoBulk but intentionally did not use S3 and lambda because all cloud infra cost and maintenance cost associated with this solution.You can find the details in our blog https://www.yobulk.dev/blog/Scaling%20a%20CSV%20importer

link

anonymouse008 1217 days ago

Ha! Sweet - well done to you and the team.

Hope I get back on CSV projects again soon to try your stuff out. Cheers

link

yosai 1219 days ago

Ohh Yes..You are spot on. It's there in our upcoming release.Stay tuned please.

link