Hacker News new | ask | show | jobs
by userbinator 2593 days ago
Base85 would probably be a better choice for storing binary as text, since it has a ratio of 5:4 instead of 4:3.

On the topic of "unusual and free large file hosting", YouTube would probably be the largest, although you'd need to find a resilient way of encoding the data since their re-encoding processes are lossy.

I like the "Linux ISO" and "1337 Docs" references ;-)

6 comments

Here is an implementation of arbitrary data storage using YouTube videos: https://github.com/dzhang314/YouTubeDrive
Wouldn’t YouTube re-encode the video and mess up with the data?
If you look at the example video, the videos are encoded in relatively large blocks that are easily recoverable from compression.
You just need enough redundancy/error correction.

Back in the day there were systems to back up to VHS tapes and those were way more lossy than YouTube https://youtu.be/TUS0Zv2APjU

I love that this exists
I want to watch some! Got any urls?
Thank the gods for base85..
You'd be at the mercy of them potentially changing their encoding scheme unannounced and corrupting your files.
Back in the day of email gateways between different networks, there used to be a terrible problems with all the tin-pot dictator IBM SYSADMINs at BITNET sites who maintained their own personal styles of ASCII<=>EBCDIC translation tables, so all the email that passed through their servers got corrupted.

EBCDIC based IBM mainframe SYSADMINs on BITNET were particularly notorious for being pig-headed and inconsiderate about communicating with the rest of the world, and thought they knew better about the characters their users wanted to use, and that the rest of the world should go fuck themselves, and scoffed at all the unruly kids using ASCII and lower case and new fangled punctuation, who were always trying to share line printer pornography and source code listings through their mainframes.

"HARRUMPH!!! IF I AND O ARE GOOD ENOUGH FOR DIGITS ON MY ELECTRIC TYPEWRITER, THEN THEY'RE GOOD ENOUGH FOR EMAIL! NOW GET OFF MY LAWN!!!" (shaking fist in air while yelling at cloud)

It was especially a problems for source code. That was one of the reasons for "trigraphs".

https://stackoverflow.com/questions/1234582/purpose-of-trigr...

https://en.wikipedia.org/wiki/Digraphs_and_trigraphs

>Trigraphs were proposed for deprecation in C++0x, which was released as C++11. This was opposed by IBM, speaking on behalf of itself and other users of C++, and as a result trigraphs were retained in C++0x. Trigraphs were then proposed again for removal (not only deprecation) in C++17. This passed a committee vote, and trigraphs (but not the additional tokens) are removed from C++17 despite the opposition from IBM. Existing code that uses trigraphs can be supported by translating from the source files (parsing trigraphs) to the basic source character set that does not include trigraphs.

I always wondered what the purpose of trigraphs were other than to help win obfuscated code contests haha
You'd want to make sure the coding scheme is in the identifiable visual data. Think QR codes.

Build in a bit of redundancy, and I think it would work.

It would probably work, yes. But I don't think many people want their backups powered via a service that "will probably work".
I don't think you want your backups in Google Docs either, given that Google may decide to ban you for TOS violations at any time.

I really do think videos would would work, reliably, given sufficient redundancy. Again, we have QR codes already, so this is a proven idea. You can't make QR codes unreadable without removing lots of perceptual visual details. The risk, as with using Google docs, isn't that Google will change their encoding, but that Google will just take down the videos for service misuse.

I think it would be comparatively more difficult for Google to detect this stuff in a video compared to a text document, because you expect some videos to be long and large. The entirety of the Encyclopedia Britannica comes out to less than 500 MB in a .txt document, so using any reasonable amount of space in a Google Doc should quickly raise red flags.

that would be tough if youtube doesn't save the originals
youtube probably doesn't save the originals (though they could in some cold-storage tape drives, perhaps). But even still, it's not difficult to imagine that there may at some point exist a compression algorithm that can be applied to existing compressed video that could change a couple bits around in whatever encoding scheme you've chosen. Depending on the file type, that could be enough to corrupt the whole thing.

Sure you can get around this by adding ECC, but that isn't implemented here.

> Base85 would probably be a better choice

Base64 has the advantage of relative ubiquity (though Base85 is hardly rare, being used in PDF and Git binary patches). It also doesn't contain characters (quotes, angled brackets, ...) that might cause problems if naively sent via some text protocols and/or embedded in XML/HTML mark-up.

> YouTube ... you'd need to find a resilient way of encoding the data [due to lossy re-encoding]

That should be easy enough: encode as blocks or lines of pixels (blocks of 4x4 should be more than sufficient) in a low enough number of colour values (I expect you'd get away with at least 4bits/channel/block with large enough blocks so 4096 values per block) and you should easily be able to survive anything the re-encoding does by averaging each block and taking the closest value to that result.

Add some form of error detection+correction code just for paranoia's sake. You are going to want to include some redundancy in the uploads anyway so you can combine these needs in a manner similar to RAID5/6 or the Parchive format that was (is?) popular on binary carrying Usenet groups.

Would be cool to use the audio for some extra bandwidth too, get some Sinclair Spectrum-esque (albeit in stereo) bleeps to accompany the video.
A few years ago I also found a backup tool that converted backups to DV videos, so that you could write them on cheap DV cassettes. It was something like more than 10 GB per cassette. Definitely not bad for a few years ago.
Just FYI, turn your volume way down when listening to these. Wouldn't be a good idea to blow your eardrums on this.
Why not yEnc? 1-2% overhead and it's been in use on UseNet for binary storage for a very long time.
The nice thing about yEnc is that it only has to escape NUL, LF, CR, and the escape character itself '=', so it essentially uses all but 3 characters out of the 255 possible values.

While this works over NNTP, SMTP and IMAP (and possibly POP), I'm not sure if it will work over HTTP if any of the servers use the Transfer Encoding header.

Just use Unicode for the optimal highest possible base 1,114,112!