| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by remilouf 2615 days ago

Thank you for taking the time to answer thoroughly !

> If it is only used by data science team then you can rip apart the solution and work towards a more logical infrastructure where all you are doing is cleansing, normalizing and deriving features and these become your central feature repository which your team can pull and build models.

It is only used by data scientists. What do you mean by a feature repository? How would you organize it so people can push new features? This sounds very interesting.

> you are talking about denormalization as something you are planning to do, that should have been the starting point of any HDFS/SPARK/Parquet based solution.

It is something that we have to do, but the table have been dumped as is in S3 and every project rebuilds the whole derived dataset regularly. Since these operations are very brittle (a lot of manual work and even transformations performed in notebooks), this is something people dread doing. I am trying to secure this at the moment by writing Makefiles that remove human intervention, but at the end of the day I would like to avoid people spend hours waiting for new data when they need it.

> I can suggest tools for explorations, data quality check etc.

I would appreciate it. Put simply, we get data about the evolution of the stock of clients, transactions with their clients, product descriptions, etc. that is dumped into S3 (I scheduled a chat with people upstream to see what happens). We have 3,4 projects for each client. What currently happens is every team writes the same code to build features in their separate repositories, this code is re-executed every time new data arrives (weekly). These features are then used in prediction models.

Besides the brittleness of the process, I found that people are reluctant to analyse the data because it takes an unreasonable amount of time.

1 comments

sreekanthr 2611 days ago

For your problem, I would suggest you to take a look at streamsets. They have an ETL plus data drift system in place, which is really interesting.

Ref: https://streamsets.com/

>Besides the brittleness of the process, I found that people are reluctant to analyse the data because it takes an unreasonable amount of time.

Is this because of the bad queries or way the data is organized?

>It is only used by data scientists. What do you mean by a feature repository? How would you organize it so people can push new features? This sounds very interesting.It is only used by data scientists. What do you mean by a feature repository? How would you organize it so people can push new features? This sounds very interesting.

Can you take a look at Feast by Go-Jek: https://github.com/gojek/feast There are similar projects by different big players in market, this should get you started on idea which I was talking about.

PS: Sorry, was traveling that is why there was a delay in answering your question.