| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by shoyer 1963 days ago
	It’s worth noting that the author of this report is Peter Baumann, the author of Rasdaman. So it shoukd be no surprise that Rasdaman comes out on top in the various benchmarks and is presented as the leading “array database.” My two cents (as the author of Xarray, one of the Python libraries mentioned in this report) is that it’s questionable whether we need “array databases” at all. Certainly we need to be able to store arrays and compute with them, but do we need an integrated solution that does both at the same time with a query language that looks like SQL? Maybe not, in an era of cloud computing, prolific open source software and when everyone who works with big array datasets already knows Python.

5 comments

misev 1963 days ago

You cannot automatically conclude that rasdaman comes on top because the author is also involved in its development, although it may be suspicious. I am also one of the authors and contributed to doing the benchmarks: our goal was to configure and implement the queries in the best way for each system and achieve comparable results. Note that this report was done almost three years ago, the results may be getting out of date.

This query language that looks like SQL is an official part of SQL now [1]. Surely there is place for integrated DB solutions that let you work with both relational and array data in one place? There are more benefits in this than just performance/scalability. Think of building services on top of big array datasets, beyond one-off data science experiments.

1. https://www.iso.org/standard/67382.html

link

snidane 1963 days ago

I agree that array processing probably doesn't need to live in a database, but I think databases should base their foundations on arrays.

In kdb you go from primitives to arrays to tables. In SQL you go from primitives straight to tables which makes it cumbersome to do any simple one column or array ops. Such as excluding a column from a select expression.

Compare first class array support in sql vs hypothetical programmable sql

    select table.* - {name, age}
    from table

    select (
        select column
        from table
        where column
          not in ('name', 'age')
    )
    from table

link

mamcx 1963 days ago

> In SQL you go from primitives straight to tables

This make a lot of sense. Because primitives ARE "tables", columns ARE "tables".

A primitive is a relation of one column/row. This is what allow you to do:

    SELECT * FROM (SELECT 1)a

What sql/rdbms not do it well is to exploit this very well.

link

joppy 1963 days ago

Xarray is absolutely sublime by the way, thank you for your work there. I stuffed around with multi-indices in pandas for a good while before finding Xarray and instantly having all my problems solved :)

link

RyanHamilton 1963 days ago

If you can't compute where the data is, you will end up having to pull back all the data to calculate against it. Assuming the cost of transfer for the full data set is >50% compared to performing at least some calculations that reduce the size, it's worth it.

link

snidane 1963 days ago

This is Stonebraker's argument for shared-nothing architecture and it applies well for interactive ad-hoc analytics on well structured data.

Many orgs these days store all data in data lake shared-disk architectures and pull down the subsets. The performance hit of pulling down data over high bandwidth channel such as s3 - ec2 is much more reasonable to companies than storing everything on expensive compute instances just so that the "data would be there" ready for querying if somebody ever needs it.

link

neolog 1963 days ago

I'm curious how google is making money with array technologies. Is it mostly marketing to get geodata people to use GCP, or is there some other product?

link

shoyer 1963 days ago

The short answer is that I wrote Xarray before I showed up at Google :)

link