Hacker News new | ask | show | jobs
by james_marks 171 days ago
Congrats on being launchable!

I've written seed data scripts a number of times, so I get the need. How do you think about creating larger amounts of data?

E.g., I'm building a statistical product where the seed data needs to be 1M rows; performance differences between implementations start to matter.

1 comments

Thanks! At 1M rows, I think a few things matter:

Streaming: Can't hold it all in memory. Generate in chunks, write, release, repeat.

Format choice: Parquet with row groups is fast and compresses well. SQL needs batched inserts (~1000/statement). Direct DB writes via COPY skip serialization entirely is usually fastest.

FK relationships: The real bottleneck. Pre-generate parent PKs, hold in memory, reference for children. Gets tricky with complex graphs at scale.

Parallelization: Row generation is embarrassingly parallel, but writes are serial. Chunk-then-merge is on our radar but not shipped yet.

What does your stat product need, realistic distributions or pure volume/stress testing?

Why does this read like AI slop?
because it is, but its still true lol