What kind of benefits does CUDA bring to databases? I've never heard of running a database on a GPU before. Couldn't find anything on their homepage other than comparison with a few other db options
This is a Distributed SQL engine not a database. We store no data. You store your data in HDFS, S3, posix, NFS etc. We allow you to query directly from these filesystems of the file formats you have already. You can look here to see the file formats cudf supports. https://github.com/rapidsai/cudf/tree/branch-0.9/cpp/src/io
Greatly increased processing capacities. We can just perform orders of magnitudes more instructions per second than a cpu with the gpus we are using.
Decompression and parsing of formats like CSV and parquet happens in the GPU orders of magnitude faster than the best cpu alternatives.
You can take the output of your queries and provide it to machine learning jobs with zero copy ipc and get the results back the same way. We are all about interoperability with the rapidsai eco system.
Is there any reason why a SQL format isn't is that list? Wondering if there's a way to join SQL sources with file storage sources. An example of this would be filtering or enrichment operations.
When you say SQL format do you mean being able to read the output of a jdbc or odbc driver?
If this is the case then mostly just time. You are not the first person to ask about this and now that there are java bindings in cudf this might become easier to make a reality in the next few months.
Or do you mean being able to read a database's file format natively?
If this is the case there are many reasons.
1. There are many poorly/non documented formats
2. Even if you decide to read some other DB's format natively, those formats change over time
3. Little control of how and where the data is laid out
I've read the website, but I could't find a hint that the engine is distributed. Even the spark benchmarks compare a single instance with multiple nodes.
Is it distributed? How do I set it up in a distributed mode?
Does it support nested parquet (something that even spark itself struggles to support inside SQL).
In summary, you get snappy, interactive query speeds on large data sets. I've ran that locally and the results are pretty amazing compared to Postgres or even Tableau in-memory.
OmniSci transparently caches data across the memory of the CPUs and GPUs on a server, so after the initial read, it is likely that the data for subsequent queries will be in memory.
We've also optimized our storage formats and multithreaded our disk reads, such that we can easily hit many gigabytes per second on flash storage. Plus, new persistent memory technologies like Intel Optane will enable even more instant reads from "cold" storage.
CUDA by itself brings easy-to-run parallel algorithms.
It's not of much value for databases unless you have a proper infrastructure set up to use it correctly. Same is true for columnar aspects, for example.
People have been building columnar databases to do analytics quickly. GPUs (with CUDA) can run analytics operations (think join, group by, math, sorting) on columnar data in a much more efficient manner. They're designed for operations on vectors, which columns are.
We've been doing this ourselves too with SQream DB: https://sqream.com. It's an enterprise data warehouse with GPU acceleration. We use CUDA exclusively too.
PG-Strom is a GPU accelerator extension for PostgreSQL which has been around for a few years now. I have not tried it myself... http://heterodb.github.io/pg-strom/
You can try it out yourself here https://colab.research.google.com/drive/1r7S15Ie33yRw8cmET7_...
Or use dockerhub https://hub.docker.com/r/blazingdb/blazingsql/
The benefits are.
Greatly increased processing capacities. We can just perform orders of magnitudes more instructions per second than a cpu with the gpus we are using.
Decompression and parsing of formats like CSV and parquet happens in the GPU orders of magnitude faster than the best cpu alternatives.
You can take the output of your queries and provide it to machine learning jobs with zero copy ipc and get the results back the same way. We are all about interoperability with the rapidsai eco system.