Hacker News new | ask | show | jobs
by pawnednow 2143 days ago
The author says

>>>As I reviewed the previous work and struggled to understand the queries, I felt like SQL wasn’t the right tool for the job — it was getting in the way of progress. So I paused, took a step back and looked for an alternative approach.

Not sure why the author thought SQL wasn't the right tool as he glosses over this justification. It seems to me like a time series or sorting problem. The challenge was inserting meta data and then sorting. This relates to problem with query language but the article seems to imply the problem was somehow solved with GO. Either way my knowledge on this subject is limited and probably much shallower than the author.

2 comments

I agree. I don't see how any of this could not be solved with SQL either. Struggling to understand the previous work and other's queries does not mean the tooling is the wrong thing for the job.
> It seems to me like a time series or sorting problem. The challenge was inserting meta data and then sorting.

The core problem, as I see it, is missing asset identifiers. The timestamp of each asset in the Kafka queue acts as the ID.

This is an ETL and data cleansing task. My first step would have been to use a Content Addressable Storage technique, like Git hashes, to assign a unique identifier to each asset which also solves the de-duplication task. Extracting content metadata, like true publication date, and inserting into a structured data store then follows.

Kafka should have been one part of the ETL pipeline, not act as the structured data store itself.