| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by evilturnip 1461 days ago
	We're currently looking into datalake implementations. Right now, we only have 1 or 2 data sources. Current thinking is reading them on the fly, combine them using pandas dataframe and query that. Anyone have experience with doing something similar?

4 comments

mason55 1461 days ago

Depends on the type of data you're processing, your business goals and the existing consumers that you need to support.

At a minimum I'd suggest planning to load the data from the data lake into an RDBMS (OLAP/columnar preferably). Then it's accessible to more than just Python scripts (BI tools, users of other languages, etc).

Depending on how much data there is, should also plan on data summarization strategies. You can either build some common rollups to ensure that consumers are all looking at the same summaries or you can let consumers build their own transform/load pipelines from the raw data lake or you can let consumers build their own transform pipelines from the data in the data warehouse (using something like dbt).

The benefits of a data lake architecture really appear when you have lots of sources, lots of disparate consumers, and lots of data, with some schema evolution & unstructured parts thrown in. If you only have 1 or 2 sources, small enough data to query raw data in Pandas, and consumers are restricted to Python scripts, then you can skip a lot of the architectural headache of building a data lake for now (just make sure to archive your raw data somewhere if you want to be able to pull it into a data lake in the future).

link

alexmerced 1461 days ago

This presentation may help: https://www.youtube.com/watch?v=LiC9vZATv0o&t=8s

link

eatonphil 1461 days ago

You can do this for small datasets and I built an app around exactly this use case [0].

As your data gets bigger (just over time even if individual days don't emit more data) you'll end up building out partitioning schemes yourself or you'll move to a system that does that for you like bigquery/snowflake/etc.

[0] https://github.com/multiprocessio/datastation

link

tomnipotent 1461 days ago

It can seem like overkill, but I can't recommend Snowflake enough. It's so simple to setup and manage, and JSON support makes it easy to just drop JSON line files into a table and query on the fly.

Depending on your data/query volume, it can also be very cheap.

link

chrisjc 1461 days ago

Also recommend Snowflake!

However, this article was about table formats so maybe OP wasn't thinking about JSON. If so, looks like Snowflake only supports read operations atm. Snow Summit this week, so maybe that will change?

link

I_Love_Data_22 1461 days ago

When I think of SNOW, cheap doesn't come to mind.

link

tomnipotent 1458 days ago

At $2/credit, I can't imagine the OP's workload will cost more than a few coffees a month.

link