Hacker News new | ask | show | jobs
by towelpluswater 2302 days ago
Completely agree with you. Data lakes were marketed well because, well... data warehousing is hard, and a lot of work. Data lakes don't make that hard work disappear, it just changes how and where it happens.

I've found data lakes complement DW's (in databases) well. Keep the raw data in the lake and query as needed for discovery, and load it into structured tables as the business needs arise.

Data lakes alone are doomed to be failures.

2 comments

I don't think anyone ever suggested that. The use case for a data lake is precisely the one you describe, it allows you to start collecting data without having to do a lot of work ahead of time before y9u know how you actually want to structure things. Allows for schema evolution too. It's not a panacea, it's just a way to avoid the inertia most large data projects have.
Nobody here suggested it, just something I see organizations doing quite often.

(edit: the rationale behind this tends to be that you can avoid the heavy lifting of ETL/transformation logic by just using a data lake - obviously not the case, as most of us know)

I've worked on nearly a dozen Data Lakes. I have never seen nor heard of anyone who said that Data Lakes meant you could avoid ETL. If anything it has necessitated more of it as users expect to join these disparate data sets.

There is after all a reason that the role Data Engineer became popular just as Data Lakes become popular.

Just means we have different anecdotal experience, then. Very little of mine has been in the tech industry.
No. Data lakes were marketed well because they are significantly cheaper and solve long standing problems.

S3 is basically free and has unlimited scalability. Oracle, DB2, HANA, SQL Server etc are ridiculously expensive and struggle under high concurrent load even with QoS in place.

S3 != a data lake.

If you're able to solve the problems that you were previously using oracle or SQL Server for with S3, more power to you, but the truth is that to replicate the functionality of that old Oracle server you'll start with S3, but you'll also want some querying (Aurora? RDS? Hbase?), probably some analytics and ingestion (Redshift? Kinesis? Elastic? Hive? Oozie? Airflow?), along with some security now that you've got multiple tools interacting (Ranger? Knox?), probably some load balancing (Zookeeper?), maybe some lineage and data cataloging (Atlas?), etc.

In my experience what starts with "Just throw some data in S3, forget that old crusty expensive server!" ends with 22 technologies trying to cohesively exist because each one provides a small but necessary slice of your platform. Your organization will never be able to find one person who is an expert in all of these (on the contrary, you can find an Oracle, or DB2, or SQL Server expert for half the money) so you end up with seven folks who are each an expert in three of the 22 pieces you've cobbled together, but they all have slightly different ideas on how things should work together, so you end up with a barely functioning platform after a year's worth of work because you didn't want to just start with a $400k license from Oracle.

Not sure what you are talking about.

If you have S3 you can use Athena, Redshift Spectrum or Spark as query layer. It's not 22 technologies.

You don't need ElasticSearch, Ranger, Knox, Zookeeper etc as they have nothing to do with querying.

But then it's far from basically free. Even overpriced Oracle databases can end up cheaper than locking into AWS in these cases (my experience).
I think the presumption that's differing here is query workload.

An OLAP database is, in the default case, an always-online instance or cluster, costing fixed monthly OpEx.

Whereas, if your goal in having that database is to do one query once a month based on a huge amount of data, then it will certainly be cheaper to have an analytical pipeline that is "offline" except when that query is running, with only the OLTP stage (something ingesting into S3; maybe even customers writing directly to your S3 bucket at their own Requester-Pays expense) online.

My biggest problem with Oracle is not the database itself. There is no doubt that Oracle is a fine piece of software, and is bullet proof, and has decades of experience built into it.

My problem is the scalability and elasticity of it's licensing model. It doesn't meet the needs of today's analytics without spending enormous amounts of money up front.

Nope. One can start easily with Airflow+Spark(ERM)+Presto+S3 and get about 80% what'd get from your run of the mill Oracle database. At a fraction of the price, without half the headache in procurement, licensing or performance tweaking. And better scalability.

You'd be looking at $M in licenses for anything half-serious based in Oracle tech. Becoming good at replacing Oracle stuff probably has been one of the best paying jobs for a while.

They _appear_ to solve a bunch of problems by simply punting them down the road into downstream applications.

None of the databases you listed there are OLAP databases.

Clickhouse, TiDB, Redshift, Snowflake, etc are significantly more suitable and should be the target of comparison here.

S3 is just storage. It doesn't provide any querying, crawling, metadata, provenance, or other details required for data at scale.

That's why AWS has entire product suites from Athena, Redshift Spectrum, Data Lake Formation, Glue, etc to help companies actually do something with the files stored in S3. And it's often a mess compared to just fixing their processes and ingesting it properly into a SQL data warehouse first.