Hacker News new | ask | show | jobs
by andrewstuart2 2 hours ago
I would call it clever. I'm not sure I'd call it genius.

When I'm working with data I'm working in a specific set of languages. Usually one. Yeah, other people might be working in other languages, but no individual author really needs a language-agnostic way of accessing data beyond compile time. Add to that the likely runtime boundaries that may need to be crossed instead of e.g. inlined by the compiler because it's in-language and dealing with known offsets or tags (depends on the data format of course). To the other commenter's point, am I going to have to sandbox all data access code just to be sure it's not able to do something unexpected? There's a lot of complexity here. And the inherent risk is going to slow down the operation that should be the simplest and fastest: interpreting bytes.

2 comments

A big problem with parquet, which this aims to replace, is that it's hard to add new encodings because everyone wants to stay compatible with old readers. Embedding the decoders in the file as WASM solves this problem since in theory, old readers will be able to read new files by just using the provided WASM to decode a column whose format the reader doesn't recognize.

So this is really about making a file that is forwards compatible in a way that lets you push the standards more than existing formats.

>no individual author really needs a language-agnostic way of accessing data beyond compile time.

That's so untrue! People need language-agnostic ways to access data all the time, and people work with data accessing them from multiple languages all the time!

If I have parquet files I can load them in duckdb, in pandas and polars, process them with various independent tools, and loads of other things... and people do that.

This is also why people like something like an SQL database, your data is not locked to some specific language / lib for access.