|
|
|
|
|
by viraptor
171 days ago
|
|
It doesn't matter if A takes much more time than B, if B is large enough. You're still saving resources and time by optimising B. Also, you seem to assume that every chunk will get embedded - they may be revisiting some pages where the chunks are already present in the database. |
|
And sure, you can reject chunks, but a) the rejection isn't free, and B) you're still bound by embedding speed.
As for resource savings.... not in the Wikipedia data range. If you scale up massively and go to a PB of data, going from kiru to memchunk saves you ~25 CPU days. But you also suddenly need to move from bog-standard high cpu machines to machines supporting 164GB/s memory throughput, likely full metal with 8 memory channels. I'm too lazy to do the math, but it's going to be a mild difference at O($100)
Again, I'm not arguing this isn't a cool achievement. But it's very much engineering fun, not "crucial optimization".