|
|
|
|
|
by MSM
2302 days ago
|
|
S3 != a data lake. If you're able to solve the problems that you were previously using oracle or SQL Server for with S3, more power to you, but the truth is that to replicate the functionality of that old Oracle server you'll start with S3, but you'll also want some querying (Aurora? RDS? Hbase?), probably some analytics and ingestion (Redshift? Kinesis? Elastic? Hive? Oozie? Airflow?), along with some security now that you've got multiple tools interacting (Ranger? Knox?), probably some load balancing (Zookeeper?), maybe some lineage and data cataloging (Atlas?), etc. In my experience what starts with "Just throw some data in S3, forget that old crusty expensive server!" ends with 22 technologies trying to cohesively exist because each one provides a small but necessary slice of your platform. Your organization will never be able to find one person who is an expert in all of these (on the contrary, you can find an Oracle, or DB2, or SQL Server expert for half the money) so you end up with seven folks who are each an expert in three of the 22 pieces you've cobbled together, but they all have slightly different ideas on how things should work together, so you end up with a barely functioning platform after a year's worth of work because you didn't want to just start with a $400k license from Oracle. |
|
If you have S3 you can use Athena, Redshift Spectrum or Spark as query layer. It's not 22 technologies.
You don't need ElasticSearch, Ranger, Knox, Zookeeper etc as they have nothing to do with querying.