| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MSM 2350 days ago

S3 != a data lake.

If you're able to solve the problems that you were previously using oracle or SQL Server for with S3, more power to you, but the truth is that to replicate the functionality of that old Oracle server you'll start with S3, but you'll also want some querying (Aurora? RDS? Hbase?), probably some analytics and ingestion (Redshift? Kinesis? Elastic? Hive? Oozie? Airflow?), along with some security now that you've got multiple tools interacting (Ranger? Knox?), probably some load balancing (Zookeeper?), maybe some lineage and data cataloging (Atlas?), etc.

In my experience what starts with "Just throw some data in S3, forget that old crusty expensive server!" ends with 22 technologies trying to cohesively exist because each one provides a small but necessary slice of your platform. Your organization will never be able to find one person who is an expert in all of these (on the contrary, you can find an Oracle, or DB2, or SQL Server expert for half the money) so you end up with seven folks who are each an expert in three of the 22 pieces you've cobbled together, but they all have slightly different ideas on how things should work together, so you end up with a barely functioning platform after a year's worth of work because you didn't want to just start with a $400k license from Oracle.

3 comments

threeseed 2349 days ago

Not sure what you are talking about.

If you have S3 you can use Athena, Redshift Spectrum or Spark as query layer. It's not 22 technologies.

You don't need ElasticSearch, Ranger, Knox, Zookeeper etc as they have nothing to do with querying.

link

dx034 2349 days ago

But then it's far from basically free. Even overpriced Oracle databases can end up cheaper than locking into AWS in these cases (my experience).

link

derefr 2349 days ago

I think the presumption that's differing here is query workload.

An OLAP database is, in the default case, an always-online instance or cluster, costing fixed monthly OpEx.

Whereas, if your goal in having that database is to do one query once a month based on a huge amount of data, then it will certainly be cheaper to have an analytical pipeline that is "offline" except when that query is running, with only the OLTP stage (something ingesting into S3; maybe even customers writing directly to your S3 bucket at their own Requester-Pays expense) online.

link

billman 2349 days ago

My biggest problem with Oracle is not the database itself. There is no doubt that Oracle is a fine piece of software, and is bullet proof, and has decades of experience built into it.

My problem is the scalability and elasticity of it's licensing model. It doesn't meet the needs of today's analytics without spending enormous amounts of money up front.

link

cjalmeida 2349 days ago

Nope. One can start easily with Airflow+Spark(ERM)+Presto+S3 and get about 80% what'd get from your run of the mill Oracle database. At a fraction of the price, without half the headache in procurement, licensing or performance tweaking. And better scalability.

You'd be looking at $M in licenses for anything half-serious based in Oracle tech. Becoming good at replacing Oracle stuff probably has been one of the best paying jobs for a while.

link