| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tbrannan 171 days ago

Thanks! At 1M rows, I think a few things matter:

Streaming: Can't hold it all in memory. Generate in chunks, write, release, repeat.

Format choice: Parquet with row groups is fast and compresses well. SQL needs batched inserts (~1000/statement). Direct DB writes via COPY skip serialization entirely is usually fastest.

FK relationships: The real bottleneck. Pre-generate parent PKs, hold in memory, reference for children. Gets tricky with complex graphs at scale.

Parallelization: Row generation is embarrassingly parallel, but writes are serial. Chunk-then-merge is on our radar but not shipped yet.

What does your stat product need, realistic distributions or pure volume/stress testing?

1 comments

rrr_oh_man 170 days ago

Why does this read like AI slop?

link

tbrannan 169 days ago

because it is, but its still true lol

link