Hacker News new | ask | show | jobs
by bigger_cheese 732 days ago
I work in manufacturing (large industrial plant) and the data processes we have are honestly not great - mostly it is because there are a heap of legacy system and not a lot of commonality between our data sources we have a hideous mashup of Oracle, DB2, Microsoft SQL Server etc and different versions of the different databases. There's also more bespoke industry stuff like time series historians and SCADA systems/PLCs (ABB, Citect etc) to complicate the process.

From my experience SQL is basically the lowest common denominator everything speaks and even then the Oracle SQL dialect is subtly different to Microsoft SQL for example - things are subtly different enough it introduces frustrations.

There has been movement in last couple of years to hoist everything into a common "datalake" but my understanding has been that ingestion into this lake is not a simple process by any means and requires batch processes that need demanding compute resources and is slow (i.e. takes many hours and runs over night).

5 comments

> [some process] is not a simple process by any means and requires batch processes that need demanding compute resources and is slow (i.e. takes many hours and runs over night).

Sounds like an ideal fit for on-prem/co-located systems. The big problem with on-prem is the egress costs from wherever all your data resides.

With on-prem, doubling your hardware doesn't double your ops expenses, so it makes sense, if you already have a server-room, to fill it to capacity.

I have no experience in the manufacturing domain but it fascinates me as a data engineer. I do have experience building data lakes at scale with sub-day (microbatch/“realtime”) latency and with disparate sources. I don’t think this needs to be as complicated or painful as you expect but I don’t know enough about your data or needs to be sure. If you want to discuss specifics send me an email at the domain in my profile, I’d love to know more.
I just started using sqlglot to convert Microsoft SQL Server code to Databricks SQL, and it has been able to automate 80% of the translation (assuming it's just a select statement). You might take a look.

https://github.com/tobymao/sqlglot

Are you trying to consume historical or real-time data? In my experience this greatly influences the approach.

Node-RED is a common ETL approach in the scenario you described, but I find it too limiting beyond basic examples.

You may be interested in semantic web technologies as a means of modelling your different data sources and how they relate.