| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sradman 2143 days ago

> It seems to me like a time series or sorting problem. The challenge was inserting meta data and then sorting.

The core problem, as I see it, is missing asset identifiers. The timestamp of each asset in the Kafka queue acts as the ID.

This is an ETL and data cleansing task. My first step would have been to use a Content Addressable Storage technique, like Git hashes, to assign a unique identifier to each asset which also solves the de-duplication task. Extracting content metadata, like true publication date, and inserting into a structured data store then follows.

Kafka should have been one part of the ETL pipeline, not act as the structured data store itself.