Hacker News new | ask | show | jobs
by shoyer 1916 days ago
It’s worth noting that the author of this report is Peter Baumann, the author of Rasdaman. So it shoukd be no surprise that Rasdaman comes out on top in the various benchmarks and is presented as the leading “array database.”

My two cents (as the author of Xarray, one of the Python libraries mentioned in this report) is that it’s questionable whether we need “array databases” at all. Certainly we need to be able to store arrays and compute with them, but do we need an integrated solution that does both at the same time with a query language that looks like SQL? Maybe not, in an era of cloud computing, prolific open source software and when everyone who works with big array datasets already knows Python.

5 comments

You cannot automatically conclude that rasdaman comes on top because the author is also involved in its development, although it may be suspicious. I am also one of the authors and contributed to doing the benchmarks: our goal was to configure and implement the queries in the best way for each system and achieve comparable results. Note that this report was done almost three years ago, the results may be getting out of date.

This query language that looks like SQL is an official part of SQL now [1]. Surely there is place for integrated DB solutions that let you work with both relational and array data in one place? There are more benefits in this than just performance/scalability. Think of building services on top of big array datasets, beyond one-off data science experiments.

1. https://www.iso.org/standard/67382.html

I agree that array processing probably doesn't need to live in a database, but I think databases should base their foundations on arrays.

In kdb you go from primitives to arrays to tables. In SQL you go from primitives straight to tables which makes it cumbersome to do any simple one column or array ops. Such as excluding a column from a select expression.

Compare first class array support in sql vs hypothetical programmable sql

    select table.* - {name, age}
    from table
vs

    select (
        select column
        from table
        where column
          not in ('name', 'age')
    )
    from table
> In SQL you go from primitives straight to tables

This make a lot of sense. Because primitives ARE "tables", columns ARE "tables".

A primitive is a relation of one column/row. This is what allow you to do:

    SELECT * FROM (SELECT 1)a
What sql/rdbms not do it well is to exploit this very well.
Xarray is absolutely sublime by the way, thank you for your work there. I stuffed around with multi-indices in pandas for a good while before finding Xarray and instantly having all my problems solved :)
If you can't compute where the data is, you will end up having to pull back all the data to calculate against it. Assuming the cost of transfer for the full data set is >50% compared to performing at least some calculations that reduce the size, it's worth it.
This is Stonebraker's argument for shared-nothing architecture and it applies well for interactive ad-hoc analytics on well structured data.

Many orgs these days store all data in data lake shared-disk architectures and pull down the subsets. The performance hit of pulling down data over high bandwidth channel such as s3 - ec2 is much more reasonable to companies than storing everything on expensive compute instances just so that the "data would be there" ready for querying if somebody ever needs it.

I'm curious how google is making money with array technologies. Is it mostly marketing to get geodata people to use GCP, or is there some other product?
The short answer is that I wrote Xarray before I showed up at Google :)