|
|
|
|
|
by Paul-E
168 days ago
|
|
That's pretty neat! I did something similar. I build a tool[1] to import the Project Arctic Shift dumps[2] of reddit into sqlite. It was mostly an exercise to experiment with Rust and SQLite (HN's two favorite topics). If you don't build a FTS5 index and import without WAL (--unsafe-mode), import of every reddit comment and submission takes a bit over 24 hours and produces a ~10TB DB. SQLite offers a lot of cool json features that would let you store the raw json and operate on that, but I eschewed them in favor of parsing only once at load time. THat also lets me normalize the data a bit. I find that building the DB is pretty "fast", but queries run much faster if I immediately vacuum the DB after building it. The vacuum operation is actually slower than the original import, taking a few days to finish. [1] https://github.com/Paul-E/Pushshift-Importer [2] https://github.com/ArthurHeitmann/arctic_shift/blob/master/d... |
|