| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jonas_b 1915 days ago
	Question, if I have a CSV file that I'd like to do some quick SQL queries on before moving the results into Pandas. What would be good resource to do this? Preferably compatible with the rest of the Python-dataframe ecosystem and as simple as pd.read_csv()

10 comments

wswope 1915 days ago

SQLite is what you're looking for. If you want to use the CLI,

  .format csv
  .import <path to csv file> <table name>

Alternatively, read your data into pandas and there's extremely easy interop between a DBAPI connection from the python standard lib Sqlite3 module and Pandas (to_sql, read_sql_query, etc.).

link

thamer 1915 days ago

I would second this suggestion. I was querying CSV data imported into SQLite this weekend, and it was extremely easy to get started. SQLite is pretty snappy even for tables with millions of rows.

SQLite supports defining columns without a type and will use TEXT by default, so you can take the first line of your CSV that lists the document's dimensions, put those in the brackets of a CREATE TABLE statement, and then run the .import described above (so just CREATE TABLE foo(x,y,z); if x,y,z are your column names).

After importing the data don't forget to create indexes for the queries you'll be using most often, and you're good to go.

Another suggestion for once your data is imported, have SQLite report it in table format:

    .mode column
    .headers on

link

simonw 1915 days ago

If you want to try it in SQLite (pros: no need to run a server or install anything since it's in the Python standard library, cons: not nearly as many advanced statistical analysis features as PostgreSQL) my sqlite-utils CLI tool may help here: it can import from CSV/TSV/JSON into a SQLite database: https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...

link

saltcured 1915 days ago

The sqlite3 connection object in Python allows you to register callables which you can use as scalar or aggregate functions in your SQL queries. With this, you can fill some of the gaps compared to PostgreSQL by essentially importing Python libraries. I just found this nice tutorial while looking for relevant docs:

https://wellsr.com/python/create-scalar-and-aggregate-functi...

However, I think the limited type system in SQLite means you would still want to extract more data to process in Python, whether via pandas, numpy, or scipy stats functions. Rather introducing new composite types, I think you might be stuck with just JSON strings and frequent deserialization/reserialization if you wanted to build up structured results and process them via layers of user-defined functions.

link

rahulj51 1915 days ago

Try duckdb to query and even transform the data and then export to a pandas df.

link

vincnetas 1915 days ago

http://harelba.github.io/q/

q "SELECT COUNT(*) FROM ./clicks_file.csv WHERE c3 > 32.3"

It uses sqlite under the hood.

link

arusahni 1915 days ago

In addition to the recommendation for sqlite, I've found `csvkit` to be an extremely useful set of CLI tools for CSV munging. The `csvsql` [1] entrypoint is especially handy because it allows you to issue SQL queries against your CSV file directly vs. loading then querying.

1: https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html

link

bruiseralmighty 1915 days ago

I keep a non-commercial installation of SQLServer on my machine for this reason, but this is likely overkill for most purposes and requires a Windows machine.

It does have some nice import features for CSV data though.

link

pasta_strainer 1915 days ago

Try https://bit.io/. You can drop a CSV file and query the data with SQL.

link

ithrow 1915 days ago

You can easily import the csv file into sqlite and you don't even have to create the table fields beforehand

link

zabzonk 1915 days ago

Windows provides an ODBC driver for CSV files - I don't know how this would play with Pandas.

link

Vaslo 1915 days ago

Try SQL Alchemy?

link