| HN Mirror

Each batched load has no ordering. But the data I'm loading is not the same as the data I'm reading.

The data I'm loading is stuff like tags - e.g., <itemid>\t<tagid>. In human terms, "Dress A has a ruched collar." Mapreduce can handle data like this, even when it comes unordered.

The data I'm reading is computational results based on the loaded data - e.g., an index: <tagid>\t[<itemid1>, <itemid2>, ...] (where each itemid has been tagged with tagid). E.g., "here are all the dresses with a ruched collar."

(Actually, we do considerably more than this, nor do we need Hadoop for an index. But an index is the simplest example I could give.)

The original data is very boring. It's only after aggregation and calculation that it becomes worth reading.