|
|
|
|
|
by evilturnip
1461 days ago
|
|
We're currently looking into datalake implementations. Right now, we only have 1 or 2 data sources. Current thinking is reading them on the fly, combine them using pandas dataframe and query that. Anyone have experience with doing something similar? |
|
At a minimum I'd suggest planning to load the data from the data lake into an RDBMS (OLAP/columnar preferably). Then it's accessible to more than just Python scripts (BI tools, users of other languages, etc).
Depending on how much data there is, should also plan on data summarization strategies. You can either build some common rollups to ensure that consumers are all looking at the same summaries or you can let consumers build their own transform/load pipelines from the raw data lake or you can let consumers build their own transform pipelines from the data in the data warehouse (using something like dbt).
The benefits of a data lake architecture really appear when you have lots of sources, lots of disparate consumers, and lots of data, with some schema evolution & unstructured parts thrown in. If you only have 1 or 2 sources, small enough data to query raw data in Pandas, and consumers are restricted to Python scripts, then you can skip a lot of the architectural headache of building a data lake for now (just make sure to archive your raw data somewhere if you want to be able to pull it into a data lake in the future).