| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bsamuels 2513 days ago
	What kind of benefits does CUDA bring to databases? I've never heard of running a database on a GPU before. Couldn't find anything on their homepage other than comparison with a few other db options

5 comments

felipe_aramburu 2513 days ago

This is a Distributed SQL engine not a database. We store no data. You store your data in HDFS, S3, posix, NFS etc. We allow you to query directly from these filesystems of the file formats you have already. You can look here to see the file formats cudf supports. https://github.com/rapidsai/cudf/tree/branch-0.9/cpp/src/io

You can try it out yourself here https://colab.research.google.com/drive/1r7S15Ie33yRw8cmET7_...

Or use dockerhub https://hub.docker.com/r/blazingdb/blazingsql/

The benefits are.

Greatly increased processing capacities. We can just perform orders of magnitudes more instructions per second than a cpu with the gpus we are using.

Decompression and parsing of formats like CSV and parquet happens in the GPU orders of magnitude faster than the best cpu alternatives.

You can take the output of your queries and provide it to machine learning jobs with zero copy ipc and get the results back the same way. We are all about interoperability with the rapidsai eco system.

chrisjc 2513 days ago

Is there any reason why a SQL format isn't is that list? Wondering if there's a way to join SQL sources with file storage sources. An example of this would be filtering or enrichment operations.

// sorry if this is a stupid question.

felipe_aramburu 2513 days ago

When you say SQL format do you mean being able to read the output of a jdbc or odbc driver? If this is the case then mostly just time. You are not the first person to ask about this and now that there are java bindings in cudf this might become easier to make a reality in the next few months.

Or do you mean being able to read a database's file format natively? If this is the case there are many reasons. 1. There are many poorly/non documented formats 2. Even if you decide to read some other DB's format natively, those formats change over time 3. Little control of how and where the data is laid out

roaramburu 2513 days ago

Not a stupid question. The reason is priorities, but definitely our ideal to do predicate push down and join databases to files, streams, etc.

ohnoesjmr 2513 days ago

I've read the website, but I could't find a hint that the engine is distributed. Even the spark benchmarks compare a single instance with multiple nodes.

Is it distributed? How do I set it up in a distributed mode? Does it support nested parquet (something that even spark itself struggles to support inside SQL).

roaramburu 2513 days ago

Distributed is getting released in the next few days, I've been playing with it over the past week.

Right now we use k8s on Google K8s Engine(GKE) to deploy in distributed mode.

We don't supported nested at present, there are Rapids teams looking into this.

reilly3000 2513 days ago

Check out https://www.omnisci.com/learn/resources/gpu-database

In summary, you get snappy, interactive query speeds on large data sets. I've ran that locally and the results are pretty amazing compared to Postgres or even Tableau in-memory.

I'm personally more excited about GPUs in stream processing; its just quite a natural fit: https://github.com/rapidsai/cudf

kichik 2513 days ago

If you're interested in stream processing, check out FASTDATA.io PlasmaENGINE. We do both stream and batch processing with Apache Spark on the GPU.

https://fastdata.io/plasma-engine/

* It's not open-source and I work there.

arnon 2513 days ago

Hi Kichik :)

felipe_aramburu 2513 days ago

Blazingsql is built on top of CUDF. We are contributors to rapidsai

throwaway082729 2513 days ago

Isn't the speed bottlenecked by the storage speed. Is the data fully loaded into memory first?

tmostak 2513 days ago

OmniSci transparently caches data across the memory of the CPUs and GPUs on a server, so after the initial read, it is likely that the data for subsequent queries will be in memory.

We've also optimized our storage formats and multithreaded our disk reads, such that we can easily hit many gigabytes per second on flash storage. Plus, new persistent memory technologies like Intel Optane will enable even more instant reads from "cold" storage.

arnon 2513 days ago

CUDA by itself brings easy-to-run parallel algorithms. It's not of much value for databases unless you have a proper infrastructure set up to use it correctly. Same is true for columnar aspects, for example.

People have been building columnar databases to do analytics quickly. GPUs (with CUDA) can run analytics operations (think join, group by, math, sorting) on columnar data in a much more efficient manner. They're designed for operations on vectors, which columns are.

We've been doing this ourselves too with SQream DB: https://sqream.com. It's an enterprise data warehouse with GPU acceleration. We use CUDA exclusively too.

taf2 2513 days ago

See https://wiki.postgresql.org/images/6/65/Pgopencl.pdf

Also from 4 years ago

https://news.ycombinator.com/item?id=10151632

fanf2 2513 days ago

PG-Strom is a GPU accelerator extension for PostgreSQL which has been around for a few years now. I have not tried it myself... http://heterodb.github.io/pg-strom/