| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by debarshri 2756 days ago

> "Enough blaming the former engineer."

I was one of them. I don't work there anymore.

I believe is it not the actual situation in bol.com. If it is, I would be disappointed.

Last I remember, Bol.com has really good set of ops and dev tooling on hadoop, hbase, spark, flink etc. for scheduling, running jobs etc.

I wouldn't know why they replicated data both on hbase, elastic search etc. Having read the blog, I don't see how this fits the event sourcing pattern that bol.com was trying to implement and also, the idea of self service BI that they envisioned.

2 comments

kn7 2756 days ago

Hey Debarsh! First, thanks for taking time to read such a lengthy post.

If I am not mistaken the majority of the PL/SQL glue is owned by Gert, though you might recall better. Quite some VCS history was lost while migrating from SVN to Git. ;-)

The reason we are "replicating" the entire data is to 1) determine the affected products and 2) re-execute the relevant configurations (facets, synonyms, etc.) while making retroactive changes. (For instance, say someone has changed the PL/SQL of "leeftijd" facet.) Here, the storage is required to allow querying on every field, for (1), and on id, for (2). While id-based bulk querying is (almost) supported by every ETL source, querying on every field is not. Hence, we "replicate" the sources on our side to suffice these needs. Actually, the entire point of the post was to explain this problem, but apparently it was not clear enough.

For your remarks on event sourcing and BI, I am a little bit puzzled. I will need some elaboration on these remarks. We do have event sourcing on our side (that is how we can replay in case of need) and BI is not really interested in ETL data. Maybe I misunderstood you?

I am also confused by how you relate scheduling/running PL/SQL jobs via Hadoop, Spark, Flink, etc. Did you see the link to Redwood Explorer I shared in the post?

link

barbecue_sauce 2756 days ago

I am not Debarsh, and I am not a data engineer, but isn't the purpose of ETL for transforming data into a more accessible/palatable form for BI?

link

kn7 2756 days ago

Bol has plenty of other ETL pipelines for BI. What I meant is the data cooked for search is not (much) of interest to BI, yet. Though we do have other means to feed BI for search-relevant content.

link

debarshri 2756 days ago

To all fairness, you are right about oracle stuff ingrained in bol.com, however, I am not sure if I should go in detail, but the whole thing used to be like - Maintain event states with "versions" table and then run hadoop, spark jobs on them, and snapshot the latest computed state to oracle so that they could run BI on it.

But I understand now what you actually mean. I wouldn't call it ETL, as ETL is more about prepping the data for BI and not cooking data for search.

yea, I remember they used to have redwood for scheduling PL/SQL queries but I think majority of ETL jobs for BI were in hadoop/spark/flink.

Having said all these, I think it is quite some neat and cool engineering work, I hope you guys are successful implementing the solution.

link

debarshri 2756 days ago

> BI is not really interested in ETL data

Isn't ETL an intermediary in BI? I think I am a bit confused, to give some context, this is my understanding, you have all the services generating data, you have ETL jobs, that extract data from these services, transform and move the data to a star or snowflake schema in RDBMS prepared for BI tools for query efficiently.

link

philippeback 2756 days ago

Any feedback on how this went? The even sourcing pattern I mean.

link