| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by felipe_aramburu 2507 days ago

This is a Distributed SQL engine not a database. We store no data. You store your data in HDFS, S3, posix, NFS etc. We allow you to query directly from these filesystems of the file formats you have already. You can look here to see the file formats cudf supports. https://github.com/rapidsai/cudf/tree/branch-0.9/cpp/src/io

You can try it out yourself here https://colab.research.google.com/drive/1r7S15Ie33yRw8cmET7_...

Or use dockerhub https://hub.docker.com/r/blazingdb/blazingsql/

The benefits are.

Greatly increased processing capacities. We can just perform orders of magnitudes more instructions per second than a cpu with the gpus we are using.

Decompression and parsing of formats like CSV and parquet happens in the GPU orders of magnitude faster than the best cpu alternatives.

You can take the output of your queries and provide it to machine learning jobs with zero copy ipc and get the results back the same way. We are all about interoperability with the rapidsai eco system.

2 comments

chrisjc 2507 days ago

Is there any reason why a SQL format isn't is that list? Wondering if there's a way to join SQL sources with file storage sources. An example of this would be filtering or enrichment operations.

// sorry if this is a stupid question.

link

felipe_aramburu 2506 days ago

When you say SQL format do you mean being able to read the output of a jdbc or odbc driver? If this is the case then mostly just time. You are not the first person to ask about this and now that there are java bindings in cudf this might become easier to make a reality in the next few months.

Or do you mean being able to read a database's file format natively? If this is the case there are many reasons. 1. There are many poorly/non documented formats 2. Even if you decide to read some other DB's format natively, those formats change over time 3. Little control of how and where the data is laid out

link

roaramburu 2507 days ago

Not a stupid question. The reason is priorities, but definitely our ideal to do predicate push down and join databases to files, streams, etc.

link

ohnoesjmr 2506 days ago

I've read the website, but I could't find a hint that the engine is distributed. Even the spark benchmarks compare a single instance with multiple nodes.

Is it distributed? How do I set it up in a distributed mode? Does it support nested parquet (something that even spark itself struggles to support inside SQL).

link

roaramburu 2506 days ago

Distributed is getting released in the next few days, I've been playing with it over the past week.

Right now we use k8s on Google K8s Engine(GKE) to deploy in distributed mode.

We don't supported nested at present, there are Rapids teams looking into this.

link