Hacker News new | ask | show | jobs
by chaxor 1547 days ago
One thing I would love to see from the arxiv sites is a publicly available download of an SQLite database. They have a bunch of PDFs, and latex source - but the real killer would be a database with just the text for each section, and then the ability to generate* the pdf, using various different styles. This would save an enormous amount of space, and make things far more tidy. I suppose the images could be stored in the SQLite as blobs, but there's probably a better way with vector dbs or something.

That's what the future will probably look like. With the SQLite decentralized on IPFS or torrent, where only queries get stored on each computer, making more popular queries faster to load (more peers).

*(or maybe an archive of a tons of zstd parquets for each table? - Not sure what the best way to organize several tables in parquet is yet)

1 comments

> This would save an enormous amount of space, and make things far more tidy.

Why? The output pdf is typically smaller than the input that produces it. Using rendered pdfs seems simple and very natural, and at worst can use twice the total amount of space.

Instead of storing both pdf and source text, just store the source. The pdf is generated on demand, in whatever style you like.

Although, I had no idea PDFs were smaller than the input. I thought that they were substantially larger actually. But regardless, storing things twice is wasteful.

for the ability to modify (forge?) contents you need the sources