| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sreekanthr 2615 days ago

I cannot answer your question without full understanding of what is the current usage of your data infrastructure.

Few pointers

- Who are users of the platform? If it is only used by data science team then you can rip apart the solution and work towards a more logical infrastructure where all you are doing is cleansing, normalizing and deriving features and these become your central feature repository which your team can pull and build models. You need a governance so that team is aligned on what features are present and how do they add new features to repository. At scale of 10 people it is much easier to have this all centralized, if team is scaling out then we will have to work out a de-centralization strategy.

- If you have operational reports like business reporting & investor reporting running on this infra then I would recommend keeping analytics workload separate from operational workload. They have different needs and SLA's.

One thing which worries me is you are talking about denormalization as something you are planning to do, that should have been the starting point of any HDFS/SPARK/Parquet based solution.

I can suggest tools for explorations, data quality check etc. But that requires more understanding of what your current infrastructure is solving vs what it was intended to.

1 comments

remilouf 2615 days ago

Thank you for taking the time to answer thoroughly !

> If it is only used by data science team then you can rip apart the solution and work towards a more logical infrastructure where all you are doing is cleansing, normalizing and deriving features and these become your central feature repository which your team can pull and build models.

It is only used by data scientists. What do you mean by a feature repository? How would you organize it so people can push new features? This sounds very interesting.

> you are talking about denormalization as something you are planning to do, that should have been the starting point of any HDFS/SPARK/Parquet based solution.

It is something that we have to do, but the table have been dumped as is in S3 and every project rebuilds the whole derived dataset regularly. Since these operations are very brittle (a lot of manual work and even transformations performed in notebooks), this is something people dread doing. I am trying to secure this at the moment by writing Makefiles that remove human intervention, but at the end of the day I would like to avoid people spend hours waiting for new data when they need it.

> I can suggest tools for explorations, data quality check etc.

I would appreciate it. Put simply, we get data about the evolution of the stock of clients, transactions with their clients, product descriptions, etc. that is dumped into S3 (I scheduled a chat with people upstream to see what happens). We have 3,4 projects for each client. What currently happens is every team writes the same code to build features in their separate repositories, this code is re-executed every time new data arrives (weekly). These features are then used in prediction models.

Besides the brittleness of the process, I found that people are reluctant to analyse the data because it takes an unreasonable amount of time.

sreekanthr 2611 days ago

For your problem, I would suggest you to take a look at streamsets. They have an ETL plus data drift system in place, which is really interesting.

Ref: https://streamsets.com/

>Besides the brittleness of the process, I found that people are reluctant to analyse the data because it takes an unreasonable amount of time.

Is this because of the bad queries or way the data is organized?

>It is only used by data scientists. What do you mean by a feature repository? How would you organize it so people can push new features? This sounds very interesting.It is only used by data scientists. What do you mean by a feature repository? How would you organize it so people can push new features? This sounds very interesting.

Can you take a look at Feast by Go-Jek: https://github.com/gojek/feast There are similar projects by different big players in market, this should get you started on idea which I was talking about.

PS: Sorry, was traveling that is why there was a delay in answering your question.