Hacker News new | ask | show | jobs
by jonas_b 1868 days ago
Question, if I have a CSV file that I'd like to do some quick SQL queries on before moving the results into Pandas. What would be good resource to do this? Preferably compatible with the rest of the Python-dataframe ecosystem and as simple as pd.read_csv()
10 comments

SQLite is what you're looking for. If you want to use the CLI,

  .format csv
  .import <path to csv file> <table name>
Alternatively, read your data into pandas and there's extremely easy interop between a DBAPI connection from the python standard lib Sqlite3 module and Pandas (to_sql, read_sql_query, etc.).
I would second this suggestion. I was querying CSV data imported into SQLite this weekend, and it was extremely easy to get started. SQLite is pretty snappy even for tables with millions of rows.

SQLite supports defining columns without a type and will use TEXT by default, so you can take the first line of your CSV that lists the document's dimensions, put those in the brackets of a CREATE TABLE statement, and then run the .import described above (so just CREATE TABLE foo(x,y,z); if x,y,z are your column names).

After importing the data don't forget to create indexes for the queries you'll be using most often, and you're good to go.

Another suggestion for once your data is imported, have SQLite report it in table format:

    .mode column
    .headers on
If you want to try it in SQLite (pros: no need to run a server or install anything since it's in the Python standard library, cons: not nearly as many advanced statistical analysis features as PostgreSQL) my sqlite-utils CLI tool may help here: it can import from CSV/TSV/JSON into a SQLite database: https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...
The sqlite3 connection object in Python allows you to register callables which you can use as scalar or aggregate functions in your SQL queries. With this, you can fill some of the gaps compared to PostgreSQL by essentially importing Python libraries. I just found this nice tutorial while looking for relevant docs:

https://wellsr.com/python/create-scalar-and-aggregate-functi...

However, I think the limited type system in SQLite means you would still want to extract more data to process in Python, whether via pandas, numpy, or scipy stats functions. Rather introducing new composite types, I think you might be stuck with just JSON strings and frequent deserialization/reserialization if you wanted to build up structured results and process them via layers of user-defined functions.

Try duckdb to query and even transform the data and then export to a pandas df.
http://harelba.github.io/q/

q "SELECT COUNT(*) FROM ./clicks_file.csv WHERE c3 > 32.3"

It uses sqlite under the hood.

In addition to the recommendation for sqlite, I've found `csvkit` to be an extremely useful set of CLI tools for CSV munging. The `csvsql` [1] entrypoint is especially handy because it allows you to issue SQL queries against your CSV file directly vs. loading then querying.

1: https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html

I keep a non-commercial installation of SQLServer on my machine for this reason, but this is likely overkill for most purposes and requires a Windows machine.

It does have some nice import features for CSV data though.

Try https://bit.io/. You can drop a CSV file and query the data with SQL.
You can easily import the csv file into sqlite and you don't even have to create the table fields beforehand
Windows provides an ODBC driver for CSV files - I don't know how this would play with Pandas.
Try SQL Alchemy?