|
|
|
|
|
by tbrannan
171 days ago
|
|
Thanks! At 1M rows, I think a few things matter: Streaming: Can't hold it all in memory. Generate in chunks, write, release, repeat. Format choice: Parquet with row groups is fast and compresses well. SQL needs batched inserts (~1000/statement). Direct DB writes via COPY skip serialization entirely is usually fastest. FK relationships: The real bottleneck. Pre-generate parent PKs, hold in memory, reference for children. Gets tricky with complex graphs at scale. Parallelization: Row generation is embarrassingly parallel, but writes are serial. Chunk-then-merge is on our radar but not shipped yet. What does your stat product need, realistic distributions or pure volume/stress testing? |
|