Hacker News new | ask | show | jobs
by codeflo 1115 days ago
I haven't heard of either of those companies. I don't even fully understand what Databricks does. But it's clear that they have no problem shutting down a production database offering with 30 days notice, and have the gall to title this action "Investing in the Developer Experience". If this doesn't send a message that you shouldn't trust them with anything important, I don't know what would.
7 comments

> what Databricks does

It's an ancient African word that means "I am because I can't install Apache Spark".

Just install Apache Spark they said. It will be fun they said.

If you have the money, having a managed Spark instance with a bunch of added features can be a big win for some. There is a lot that goes into Spark maintenance.

It also apparently includes some performance optimizations because they control both the hardware and software. And Delta Lake is pretty cool, and hosted MLFlow integration.
Databricks built a proprietary vectorized accelerator for Spark they call Photon. It's not just that they've tuned OSS Spark especially well.
Back when I was a customer (before Photon was released, also during) they had a very good tuning, in the order of around 2x faster for the workloads we had at the time (very large graph computation and a “simple” filtering)
Databricks is a company by the people that built Spark.

They've extended and their platform does a lot now.

What is Spark?

I assume that’s Apache Spark, which is described as a “ unified analytics engine for large-scale data processing”

Still not clear for me what to use it for :-/

It is Apache Spark. It's a framework that allows processing large amounts of data in parallel on a cluster of computers.

You can use batch processing, streaming, do machine learning and graph jobs. You usually use Scala, Java, Python or R to write your code. The code is executed in Scala, so it all gets converted to it. For example, in Python you'd use PySpark and that gets written down to its scala equivalent which is then executed.

I mainly work in Python, so I'm going to talk about some features there. But it support dataframes and exposes the data in Spark DataFrames. You build operations and those slowly build a DAG. It's not until you either execute, save or request to see the data that it actually starts executing the DAG after optimizing what it needs.

If you need something that spark doesn't support, you can use regular python, but because it won't get converted to spark, it'll run on only one node and be limited. So you have to rewrite your code optimizing for it.

You can process some data in memory, you can use disk, you can use databases. Either as source or targets.

A use case can be, load the raw data as it comes in, transform the data to your intermediary states, then write out different tables based on what they need to do.

---

It's a framework that has an engine to manage code running on clusters, a language to interact with the data, abstractions and optimizations of the code, ways to store the data, checkpoints for optimizations, and other things.

Wow you are right. The blog post doesn't even mention it but the home page https://bit.io/ does.
Slight oversimplification but Apache Spark is basically the "open core" to Databricks' commercial platform.
It probably was an acqui-hire. If the product was growing at a VC investible rate, they wouldn't have sunset-ed the product. Alternatively, may be they are going rebrand it into something that aligns with databricks.
> But it's clear that they have no problem shutting down a production database offering with 30 days notice

Maybe there is no production db left from paying customers?

The homepage suggests otherwise, but who knows: https://bit.io/
> I don't even fully understand what Databricks does

The naming is really confusing. When I brick my console it's broken. I'm not sure I want to brick my data :(