Hacker News new | ask | show | jobs
by anonymouzz 2733 days ago
I wish more people would publish data as SQLite databases (if the size permits, of course; usually it does). It's so much more reliable than CSVs, which have at least a few dimensions of significant differences (quoted/unquoted, comma vs semicolon vs tab vs space, headers/no headers, comments). Not to mention that initial data exploration can be done right in an SQLite explorer/browser tool.
12 comments

I keep harping on about this at work - a sqlite file carries its schema with it so you can inspect it. With foreign keys, notnull, check and other constraints you can easily make out what the data is about. There is a driver in every language it seems, and if not the docs are very good so if you are handy with FFI it is easy to build one. It can be far more compact than XML (gulp! SOAP) with the equivalent amount of data as the amount of data gets larger. Being a single file, it can be sent over the wire like any other. And you use SQL to interface with it.

Currently building software for the ATO (Single Touch Payroll) which uses SBR (https://en.wikipedia.org/wiki/Standard_Business_Reporting). The SBR project is listed as having on-going problems, and has cost the ATO ~$AUD1b to date (https://en.wikipedia.org/wiki/List_of_failed_and_overbudget_...). One of the reasons cited is that it uses XBRL (https://en.wikipedia.org/wiki/XBRL). Now imagine if it used sqlite...

There's a fairly large ecosystem of tools for creating and processing financial data in XBRL which doesn't exist for SQLite. All the Big 4 handle XBRL already - I know because I write the software they use. It's quite straightforward to take a Word document, for example, and turn it into XBRL; we even use machine learning to automatically tag the tables.

I can easily imagine how painful it is for you to process XBRL from scratch, but it's not crazy to exploit the existing infrastructure.

Of course if you give a project to IBM I wouldn't be surprised if it costs a billion dollars, especially given they know roughly nothing about XBRL...

I'd never heard of XBRL, looks very interesting. It seems it's primarily used in financial reporting environments though. Is it suitable for general purpose reporting as well?
In principle yes, it was intended for general business reporting.

Essentially each filing (called an instance) consists of a series of 'facts', each of which reports a single value and some metadata, and footnotes, which are XHTML content attached to facts. Fact metadata includes dimensions, which can specify arbitrary properties of a fact. So a fact might be e.g. 'profit' with metadata declaring it's in 2018, in the UK, and on beer, but all of those aspects would be defined by a specific set of rules called a taxonomy. You can create a taxonomy for any form of reporting you want.

There's also a language, XBRL Formula, which allows taxonomies to define validation. It allows something semantically similar to SQL queries over the facts in an instance, with the resulting rows being fed into arbitrary XPath expressions.

Unfortunately the tools for working with XBRL are mostly quite expensive, which probably limits its application outside finance. Arelle is a free and fairly standards compliant tool that will parse, validate and render XBRL and even push the data into an SQL database, but it's written in Python and isn't very performant. (Although it's probably good enough for most uses since it's used as the backend for US XBRL filing.) I'm not sure if there are any open source tools to help with creating instances.

Also creating a taxonomy itself is quite challenging. There are (expensive) tools to help, and using them it's still quite challenging. For real-world taxonomies it usually involves a six or seven figure payment to one of the few companies with the right expertise.

XBRL is the format the SEC (security and exchange commission, the government "accountants of last resort" that check publicly traded companies). This means there's regular XBRL files (every quarter) for every large publicly traded US company.

Here's Google's latest: https://abc.xyz/investor/static/documents/xbrl-alphabet-2018...

Please do keep in mind that this is a sort of XML key-value database with a number of keys standardized, and some level of rules defined that say "if a bank lends Volkswagen money and it uses 12% of that money to Google to run ads with a repo clause, you enter add A to value X and B to value Y". In other words, there's rules that define how complex financial data is entered into those standardized values. To find those rules, there's a SEC textbook that you wouldn't wish on your worst enemy, nothing about that in the files themselves.

There's a large directory at the SEC with the quarterly XBRLs for all US publicly traded companies. Used to be accessible over FTP until a little over a year ago.

Here it is: https://www.sec.gov/Archives/edgar/

XBRL files exist for all forms to be filed with the SEC. The ones you probably want are the 10-Q and 10-K ones (q = quarterly, k = no idea, but somehow means yearly)

(of course there's an entire industry of accountants essentially about hacking those rules, and therefore the meaning of those files. So let me give you a free quick 5 year experience in financial analysis: search for "GAAP vs non-GAAP", read 2 articles, decide the conspiracy theorists are less informative than the government, and just believe the SEC is at least trying. That doesn't mean nobody's lying, but GAAP vs non-GAAP is generally not what they're lying about)

As much as I love SQLite, and while it is open source, it is a single implementation that AFAIK has no published open specification. The only way to read an SQLite file is using SQLite, and in that respect, it is for many users just as closed as wrapping something in a word document.

CSV isn't perfect, but it provides a ton of flexibility, for example, CSVs can be streamed or support parallel segmented download across a network with useful work possible during the transfer. The format is so simple that it can approach almost free to parse (see e.g. my own https://github.com/dw/csvmonkey ).

CSV is also distinguished in that regular home users with spreadsheet programs can usually do most things a developer can do with the same file. For me user empowerment trumps all other goals in software, including warts. Things like JSON, XML or SQLite definitely don't fit in that category, although I guess SQLite is at least better due to the wide availability of decent GUIs for it.

Finally as a data transfer format, SQLite has the potential to be massively inefficient. Done incorrectly it can ship useless indexes that can inflate size >100%, and even in the absence of those, depending on how amenable the data is to being stored in a btree and the access patterns used to insert it, can leave tons of wasted space inside the file, or AFAIK even chunks of previously deleted data.

>"As much as I love SQLite, and while it is open source, it is a single implementation that AFAIK has no published open specification. The only way to read an SQLite file is using SQLite, and in that respect, it is for many users just as closed as wrapping something in a word document."

That's an extreme position to take, particularly since the SQLite code is public domain. Furthermore it's one of the formats recommended by the Library of Congress for archival/data preservation:

https://www.loc.gov/preservation/resources/rfs/data.html

https://www.sqlite.org/locrsf.html

> The only way to read an SQLite file is using SQLite

This part unfortunately isn't a position, it's absolute. It's hard to imagine a situation where as a developer we would not have access to a C runtime or for any reason whatsoever would not be able to use SQLite, but the hard dependency on its code is real, and represents a real hazard in the wrong environment. A super easy example would be parsing data on say, a tiny microcontroller on an IOT device. This can start to hurt quickly:

> Compiling with GCC and -Os results in a binary that is slightly less than 500KB in size

Open formats at least give you the option of implementing whatever minimal hack is necessary to finish your job without say, introducing some intermediary to do an upfront conversion, and at least for this reason SQLite cannot really be considered a perfectly universal format

>> The only way to read an SQLite file is using SQLite

> This part unfortunately isn't a position, it's absolute.

It's also false. I know of SQLJet which is a pure java implementation, there may be others. But in the end, the SQLite format being well defined and documented [1], a sure-fire way to read an SQLite file is writing the code to read an SQLite file. Since SQLite a rock-solid, public domain, portable C library, it might not be the best idea to do that, but it is completely feasible. No one stops you from "implementing whatever minimal hack is necessary to finish your job" while using the SQLite format.

[0] https://sqljet.com

[1] https://www.sqlite.org/fileformat.html

> A super easy example would be parsing data on say, a tiny microcontroller on an IOT device.

I mean, it's a point, but I don't think anybody is saying that SQLite should replace all data storage formats everywhere. If you're just storing a few dozen short text strings with keys, plain text is fine. I don't think you'd want to have a JSON parser on a tiny microcontroller either.

By 'position' I was very obviously referring to

>"as closed as wrapping something in a word document."

Sure CVS makes it trivial to waste your time reinventing the wheel, making your own parser. The situations were there are technical limitations that prevent the use of sqlite are becoming vanishingly rare. (Not to mention the resources necessary to use sqlite is unrelated to how many implementations there are or whether it's 'open' or 'closed'.)

1) For practical purposes the vast majority of platforms that are inappropriate for SQLite are not appropriate for CSV.

2) SQLite has no standard. The same is true for CSV in practice. At least SQLite has a high quality reference implementation.

3) It’s a shame that SQLite doesn’t have a standard of some sort.

It's not extreme. It's the whole reason why WebSQL failed.
Do you have a source for that claim? Mozilla's 2010 justification for dropping WebSQL was that SQLite doesn't implement a particular standard of SQL (incidentally no SQL implementation does, SQL is notorious for that, every vendor implements something slightly different with parts of the various standards excluded and other features included.)

But more to the point, Mozilla's gripe was with SQLite's API, not with the file format used by SQLite. I can't find any source for the file format used by SQLite being the problem that tripped up WebSQL.

"AFAIK has no published open specification. The only way to read an SQLite file is using SQLite"

It's documented pretty well: https://www.sqlite.org/fileformat.html

I suppose that doesn't make it an open standard, but it hasn't changed much.

Not sure what an open standard gets you above what SQLite already is. You can’t get much more open than that.
That's basically the non-subtle version of what I was trying to convey :)
I had never seen this before! Thanks
You argue bases on a good principle. But while the principle of open well documented standard is good one, it's just a principle. The goal is portability, consistence and long term stability.

SQLite is a Recommended Storage Format for datasets according to the US Library of Congress. https://www.loc.gov/preservation/digital/formats/fdd/fdd0004...

You can open a SQLite file as text and read it and be able to discern data and tables.

Not saying it's useful but you saying "only way to read SQLite is SQLite" is hyperbole.

>...it is for many users just as closed as wrapping something in a word document.

Which users? I’m curious.

Cross platform, fully documented, Fred software, multiple implementations vs closed proprietary single platform commercial software. Really?

When I open my data I usually publish both CSV dumps (for those who prefer using awk/sed and command line tools; this is usually easier) and MariaDB reconstructive files. I'll consider SQLite databases from now on, I like that idea.
That is a truly fantastic idea. Especially if Excel added file format support for it directly to open it within excel with a double click, one table per sheet within the workbook. I know, not the ideal way to use SQLite, but that would ensure easy adoption by the average person that doesn't have SQL or relational database knowledge. (And would allow them to get a bit of that relational knowledge via power query)

Some might think Access would be the more obvious choice, but for the average user Access is not an accessible tool (we don't even install it as part of the Office suite where I work) and they're much more comfortable in Excel.

I've probably said this around here before, but I wish people trade SQLite files instead of Excel workbooks. Excel (and the like) can operate on top of SQLite. This way we have data portability, data integrity, and ease of data access.
Read only, or Excel would have to enforce constraints. But, yes, would be nice
The problem is data corruption. If a CSV is corrupted, then I could at least parse part of the data. For a corrupted SQL file, I'm done. Also, diff is not working for binary format, and it is more difficult to trace change for SQL format.

In this sense, I prefer a SQL dump file.

What is the source of the corruption. Unlikely to be disks nowadays but people should take backups of things. Very unlikely to be a SQLite buggy write but again a backup could save you. Worse case I’m sure there are tools to recover a corrupted file.
There are plenty of things that can cause it: https://www.sqlite.org/howtocorrupt.html
Nice of them to publish a guide for those who enjoy corrupted data :)

I particularly liked Fake capacity USB sticks. I didn't even realize that was a think. I can't even...

Oh this goes way back. The first SSD drives from Hong Kong advertised double or quadruple their capacity - you didn't find out until you tried to write the N+1th block and it overwrote the 0th block. Back in the 90's?
Sqlite's sqldiff might be an okay replacement for diff in many cases - https://sqlite.org/sqldiff.html
Reminds me of this HN submission: https://news.ycombinator.com/item?id=16809963

Apparently CSV is actually quite hard to parse.

It's not that CSV is hard to parse. It's that there's no guarantee that you'll get proper CSV. For example, it may literally be "CSV", without quotes. And that's fatal if values contain commas. I've even seen CSV with values that contain ``","``!
RFC4180 [1] standardizes CSV, but there are many implementations that don't read this 100% and unfortunately even more (including an extremely popular spreadsheet application) that don't write it.

If you are including CSV functionality in something you work on, please read and follow this (tiny) spec!

[1] https://tools.ietf.org/html/rfc4180

Whether Excel writes standard CSV or not depends on the user's locale settings. E.g. with a German locale you get a semicolon (;) as a separator which you can only change system-wide. However, apart from the changed seperator, it's still standard CSV, which still works with Python or SQLite (.separator ; .import foo.csv foo).

A bigger issue is that Excel tends to write large numbers in scientific notation, which is a common issue handling price lists. E.g. it'll turn EAN numbers into 6.2134e+11, losing most of the number. Then you have to go back to the XLS file and change the column type into text and exporting it again as CSV. As this is lossy you can't fix it when receiving such a file.

Something like the SQL Server Import/Export Wizard but being able to write SQLite files would be very handy.

Fun fact - we noticed SQLite wasn't RFC-compliant for it's CSV output (it used native lineendings, not CRLF, which is mandatory).

It is fixed now... but I'm now wondering whether SQLite wasn't more correct in the first place...

The worst part about CSV is the locale-specific format used by some programs. For example when exporting to CSV a German Excel will use the comma as the decimal separator in numbers and a semicolon for the value separator. So you need to specify the exact format for each individual file you want to import.
Back when I worked at an NLP company, my boss used to say that he was "morally opposed to CSV".

When sharing tabular data in text format, he always preferred TSV because commas were everywhere in the material we were working with, but tab characters were really rare.

I've always been curious about the characters in ASCII for this, but I've never seen them used in the wild. Stuff like "Group Separator" (0x1D), "Record Separator" (0x1E) or "Unit Separator" (0x1F)

Is there a reason why nobody uses these? Did someone work out back in the 90s they were pure evil and we've just never used them since?

https://www.lammertbies.nl/comm/info/ascii-characters.html

Those codes were originally for just such a purpose, but as others point out there was a bootstrapping problem. At this point, if Excel doesn’t support it, it’s not going to gain traction.

About 10 years ago I had to use those for a financial system integration. I was getting files that had been created on a mainframe, and whoever wrote it originally had the foresight to use those characters. Probably because they were based out of EMEA and understood that commas weren’t useful across national borders.
The codes between x01 and x1F were designed for telecommunications, for instance binary custom formats over RS-232 or synchronous protocols.

There is an excellent description in "C Programmer's Guide to Serial Communications" by Joe Campbell.

There’s no physical key for them. You can trivially write/edit CSV or TSV on any computer using any editor.
Indeed. I used to clean up data in UltraEdit. With the option for direct disk read/write set, there was virtually no file-size limit.
When I was working as a forensic data analyst, I felt pretty much the same. If you request CSV in discovery, there's no telling what you'll get. I mean, the data may come from custom COBOL, and then get reviewed by someone using Excel.

So yes, TSV. However, I've seen TSV with spurious tabs :(

Sometimes I ended up pushing for |-delimited data. Or even fixed-width format :)

> I've even seen CSV with values that contain ``","``!

Are you sure it wasn't an injection attempt of some kind?

It could have been, I suppose.

But more likely is twisted creativity. It seems that some businesses are still using ancient systems, based on COBOL, AS/400, etc. There's resistance to changing legacy code. So when business changes require additional data fields, fields sometimes get subdivided. So a field that originally contained stuff like |foo| now contains stuff like |"foo","bar,baz"| or whatever. That works, because there's nothing like CSV in the data system. But when someone tries a CSV export, you get garbage.

Newlines in the middle of a quoted field will cause problems for a lot of tools. And Python's csv.DictReader gives an error when a delimited file has too few fields.

But one can usually work around this.

It's hard enough that you should always use the library that handles the 5 weird cases, of which you'll only think of 3.

I made the mistake once and learned.

In several companies, the interview assignment is indeed to implement a CSV parser.
I'd rather have a proper text format which I can inspect without a highly specialized tool. In 50 or 500 years it will be hard to hunt down the right tools to open an SQLite file, build SQLite from sources (if they still exist), or reverse-engineer the file format. For a text file you only need to reverse-engineer the ASCII encoding (or maybe UTF-8).
I wish webpages were 'archived' as sqlite databases. :x

I wish a lot of metadata were defined as a database schema, and sqlite lends itself so willingly to becoming the archive/header.

Does sqlite do internal gunzip compression?

I do understand we have MHTML:

https://en.wikipedia.org/wiki/MHTML

WARC is the "standard" now for web archiving.
What about hdf5? I'm starting to use it as an in-memory, parsed cache representation for data stored in like a plaintext file document database. With python's PyTables and visidata acts as the independent data explorer app.
I like HDF5 because it's compatible with everything, but it's a bit more difficult to get started with than other modern formats. Partly this is due to design by committee including everything, and the APIs follow suit. With some more modern APIs I think it could come back as an archival format.
Hdf5 files are good if the data is write once, and numeric unless things have changed over the last fives years. From my recollection, you can’t append data so you have to rewrite the whole table and text fields are fixed width.
What size restrictions are you worried about ?

*Note I work with software that regularly stores terabytes of data in sqlite.

Heh I was thinking that > 1TiB of data in an sqlite db is a bad idea :).

How do you process that? The restrictions I had in my mind come from that it's infeasible to concurrently process a large dataset in mapreduce/bigquery style.

There are many different concurrence modules for sqlite. I can't see why you would have any problems with what you describe.

The systems I work with use it for backup data. We have many readers and writes. Some of those export a iscsi daemon that represents a block device from the backup that is then booted form.

spelling :(
Reminds me I have to extract data from an 9x day windows application. It's a jetdb that refuses to load in the first viewers I could find. Made me wish it was sqlite (although it didn't exist at the time..)
An issue there is the entry level. For CVS a common tool to read and analyse the date exists: Excel.

For sqlite there's hardly an approvable tool for analysis of the data.

Excel is driving the world.

Well, my personal rule is to never touch csv with excel, because depending on the locales, most of the time excel breaks something in the file. (Of course, in theory I might make a flawless data type mapping when importing the file to excel, but unfortunately that happening is quite rare...)
This is a concern and there are lots of issues (also think about proper escaping etc.) but Excel is ubiquitous outside hackernews's demography.