Hacker News new | ask | show | jobs
by gojomo 5503 days ago
Thanks! So on each batched load, is the previous data rewritten with interleaved new data? Or is the key ordering such that's never necessary?
1 comments

Each batched load has no ordering. But the data I'm loading is not the same as the data I'm reading.

The data I'm loading is stuff like tags - e.g., <itemid>\t<tagid>. In human terms, "Dress A has a ruched collar." Mapreduce can handle data like this, even when it comes unordered.

The data I'm reading is computational results based on the loaded data - e.g., an index: <tagid>\t[<itemid1>, <itemid2>, ...] (where each itemid has been tagged with tagid). E.g., "here are all the dresses with a ruched collar."

(Actually, we do considerably more than this, nor do we need Hadoop for an index. But an index is the simplest example I could give.)

The original data is very boring. It's only after aggregation and calculation that it becomes worth reading.