| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anonymouse008 1219 days ago

It was straight forward. Basically with the csv's I processed, there were a few sets of columns that would make natural tables, i.e. addresses that would create natural 'Street Name', 'City', 'Zip' Tables.

The first step was just sharding the csv's into 200 lines or so with a naming schema. Then storing those in S3.

Then the following was creating a manager Lambda that would send out the schema creation task to however many Lambda concurrent instances I could run, and hold the shards until instances freed up. So it allowed me to do the inefficient work in the individual Lambda instance.

The inefficient work was determining how many uniques were in the dataset with a threshold. As in if there were 50 or more uniques, then consider that column 'dead' to becoming a schema. However, if it were less, filter it up to the main manager thread.

Then the main manager thread would binary dispatch the results to combine the results into a main schema with another unique value threshold.

Overall it was a fun learning experience. The hardest part was removing the download time to get the full file into the lambda... I never got that right. The processing was dang fast, the latency to get the rows into the individual threads never got 'fast fast'

---

Good luck! Cheers!

1 comments

yosai 1219 days ago

We thought the same solution for YoBulk but intentionally did not use S3 and lambda because all cloud infra cost and maintenance cost associated with this solution.You can find the details in our blog https://www.yobulk.dev/blog/Scaling%20a%20CSV%20importer

link

anonymouse008 1217 days ago

Ha! Sweet - well done to you and the team.

Hope I get back on CSV projects again soon to try your stuff out. Cheers

link