Hacker News new | ask | show | jobs
Advice needed for backing up and hosting large amount of files
171 points by KingZil 1442 days ago
Some of you may know Hong Kong has been attacked by China and most of the pro democratic media companies have shut down.

A number of us have been working hard to backup whatever we can and some oversea universities thankfully would allow hosting on their private library domain.

However I also would like to host these files on a public domain, just to ensure more people can revisit on these events in the future.

I am considering to start a very basic category based (eg event based) website, with text and picture files, while video files (close to 6tb) will be uploaded to censorship resistant sites. And a torrent file that contains all files in that particular page.

It will be ran by my own fund/donation with monero.

What kind of option do I have? I am guessing it wouldn't be safe for me to self hosting, given the nature of these files.

If someone can give me a pointer, that will be great.

Thanks!

34 comments

Please allow me to offer you a free rsync.net account, in perpetuity, for the backup portion of your requirements.

We don’t do hosting of any kind so you’ll have to secure that elsewhere.

Just email info@rsync.net to discuss.

I just want to say, mysterious ’rsync’ person, your evangelism of your company is top notch. From this and other comments you have made on this site, rsync.net is a service that I will sign up for without hesitation the instant I have offsite backup needs.
Would it be against your ToS if the OP hosted a small $5 Hetzner server that mounted your FS with sshfs and hosted files with a simple proxy?
No.

I will say that that sounds fragile and non performant but … at the same time, the whole point of our service is to be a “dumb” primitive - in the classic Unix sense.

So, of course this is within our terms and we see plenty of folks doing things like this.

Don't use hetzner for shit like this. They're EXTREMELY strict on abuse.
Anecdata: Hetzner does not care what files are on their servers as long as you don't also serve them publicly from their servers. So if they only use it as a network storage for another server, it should be fine.
How is this abuse? You're paying hetzner for a VPS. If running a proxy server is an abuse of their TOS, then in my mind their service doesn't seem much more than a scam. I'm glad to know to avoid their service.
If this is HK/CN related stuff, I'd anticipate Hetzner to be drowned in torrential rains of fake abuse emails.
I think this is the first time I've seen a Pleroma link "in the wild" without me searching for it.

I am very happy it's catching on. Cheers!

Whoa that’s a cool trick!

It’s important to note, however, that sshfs has been archived by its creator. What this entails with regard to the future of the project is uncertain though.

I'm not sure what protocols rsync.net exposes, but it's probably supported by rclone if there are others than ssh.
Hi!

Is there a way to get a test account/download file to test bandwidth/latency? On top of this, are there plans for nodes closer to or preferably in Australia?

By the way, your personal websites certificate has expired today :)

It's lovely to see things like this.
I am looking to wean off Dropbox. Do you know if Syncthing in combination with rsync is a good combo?
I forget - isn’t syncthing straight up SFTP ?

Should work perfectly…

Usually every node that share a folder need to have syncthing installed.

The only way to make it work with rsync.net would be to mount a sshfs on a machine running syncthing.

Not sure if the performance will be good.

Maybe with Rclone ?

Had been a happy rsync customers with several companies, this makes me happy.
Please upload to https://ipfs-gateway.cloud/

1. This hosting is based on IPFS, which mean the data is immutable and no censorship can be apply, you can retrieve the data from another gateway in this list https://ipfs.github.io/public-gateway-checker/. You can find out more detail about IPFS here https://ipfs.io/

2. The owner of this hosting has donated his storage to the community and you can upload for free, this was original posted on Reddit at: https://www.reddit.com/r/ipfs/comments/v0bnd1/ipfsgatewayclo.... He has approximately 13.7PB of storage =)))

After upload, you might want to save all the CID of your data and share them with your people (just like a URL)

IPFS is a good solution in theory, but in my experience it works so poorly in practice that I wouldn't trust it as my only form of archival. Certainly add files to IPFS too if you want, but not exclusively.
I second this and for anything other than "need to fetch a file or two occasionally" it's essentially broken in my mind. Expect to wait anywhere from 3 - 30 seconds for first byte. When modern CDNs are doing first byte in 100ms (or whatever) over HTTPS that's an eternity for most users.

Virtually all IPFS storage and bandwidth from pinning and gateway providers is at least 2-3x what you'd expect to pay for S3, etc (because they just use S3 on the backend). The public IPFS gateways have such low request limits and especially poor performance they often struggle to load a website with more than a couple of IPFS hosted assets on it.

If you want to run your own node go-ipfs is extremely difficult if not impossible to use at anything more than toy scale. Eats RAM like crazy, uses a TON of bandwidth (not unexpected but still seems like a lot), garbage collection is broken, and much more...

Frankly the state of IPFS is embarrassing considering it's seven year old tech.

For the delay do you mean via the web gateways, or directly over IPFS?

For storage, most of the Filecoin-backed serves (like http://web3.storage, https://nft.storage, and https://estuary.tech are either free or close to that for most use cases. Have you looked at those recently?

I hear you on go-ipfs, but there are now multiple implementations, include rust-ipfs, that are also getting there.

Is Dat/hypercore or other similar alternative more reliable in practice? Are torrents better? (But wouldn't tracker hosting be a "chicken vs. egg" issue?)
Hmm, I remember Dat being a bit more reliable when I tried it, but I have used IPFS way more. Torrents are good too, but you can't update them, which might be an issue.
ipfs is transport, not storage.
exactly.
Have you looked at the Internet Archive or Archive Team?

The Internet Archive provides free storage (last I knew from ~2014). https://archive.org/services/docs/api/ias3.html

thanks, this is actually what we used for the initial backup, but what we foresee in the future is how to "get" the information out, because in a few decades, these media name may get forgotten all together. We also need to host the video files, which can be tricky on there.
if you want to stream the video... you need hosting in a neutral country like Switzerland and you will pay a lot for it, especially if it gets any attention at all and the unfriendly government starts to dDos it. You probably want to have it as a set of torrents, mirrors, and pop it up in different locations. Don't rely on a single locale. Again, 6Tb is not a lot. Make a site with nothing but links, hold the data in mirrors and torrents. Also, only pay for hosting in crypto and use fake EVERYTHING. fake name fake IP, fake browser when you set it up. Let paranoia guide you. If you think you're being too paranoid, you're not being paranoid enough.
>you need hosting in a neutral country like Switzerland

Switzerland is only neutral on paper and has a history of letting itself and it's tech companies influenced by powerful foreign nations (the German-US - Crypto AG scandal , ProtonMail sharing info on some of its French users to the French authorities, etc.)

You might get good protection if you're a Swiss citizen living in Switzerland, but if you think Switzerland is some bastion of digital safety for foreigners, think again. If you're a foreigner, the Swiss authorities will not hesitate to throw you under the bus if you're being targeted by another powerful nation state with influence in Switzerland.

I don't think Swiss authorities would take down a site with video evidence of human rights abuses at the behest of the Chinese government. But hey, maybe times have changed.
> you need hosting in a neutral country like Switzerland and you will pay a lot for it

If you are actually interested in Swiss hosting for whatever reason, there are plenty of VPS & Dedicated offers these days, like Exoscale, that offer really fair pricing comparable to Vultr or DO

Keep a copy in AWS S3, they are extremely reliable for storage, you’ll sleep well knowing they are there. But that S3 bucket should be private and locked down - it’s your master copy - and AWS is the most expensive option for outgoing transit so you want to use it as little as possible. For routine usage keep a second copy in Backblaze B2 which is the cheapest storage you can get that isn’t running out of someone’s basement. I’d use Digital Ocean to serve files from that - doesn’t have to be anything major, just something running nginx. I’d front that with Cloudflare which is essentially free for content that doesn’t change frequently. Your DO instance should only respond to requests from Cloudflare’s backend IPs and only if those requests contain some magic header you inject. That makes it near impossible to find your DO host and access it directly, and it is the only thing that knows your backblaze B2 secrets so nobody is accessing B2 directly either.

If you want to add an obfuscation layer in front of Cloudflare - though maybe not needed in your case since the content itself isn’t illegal in most of the world - but if you want to serve through Tor to protect Cloudflare you can - DO is a good option again - and if you go that route your Cloudflare site should have some random unrelated name and should only serve files if the requests have a magic header. Tor is where your going to reverse all the money you save with the Cloudflare + DO + B2 setup, your going to pay mostly for network usage, Tor has a lot of overhead. You can scale Tor horizontally across multiple hidden services if you can afford it.

If using S3 don't forget to setup a lifecycle policy to transition to a cheaper storage like S3 Intelligent Tiering, and possibly Glacier. Maybe just avoid One-Zone IA for better durability.
I'm planning to look into https://www.storj.io/pricing (150GB free if I understand this correctly) for my own backup needs. You could theoretically host your own node somewhere on a cheap VPS, get credits and pay for your own storage? Not sure - I myself have to do some research.

Then you'd need a fronend capable mounting Storj and they seem to have pretty good documentation.

Again, unaffiliated, and not tested (yet)

Another alternative - pay google (https://one.google.com/about/plans) via a VPN in Turkey and you have 10TB for about 20USD a year. Make the files shared and there you go.

Oh and: I'm very sorry about what's happening in HK the last few years. I still hope the process is reversible.

They also provide S3 compatible gateway meant to be used to share content stored on the bucket publicly.

https://docs.storj.io/dcs/api-reference/s3-compatible-gatewa...

Time4vps is EU. They have excellent prices for storage servers. I've been using one to backup for afew years.

my affiliate link (https://www.time4vps.com/?affid=1881)

You can host it anywhere and just "proxy" it from a throwaway box

So in that case you can use any commercial storage option suitable for that like Backblaze/S3/etc and simply use a 5$ VPS (which you can rotate often) in front

consider this alternative.

Split it into parts. Create a torrent file for each part. Get a private (home) server in the US and seed it, e.g. I have 1Gb upstream and would happily donate that to it 12 hours a day. Then come back to HN and other places with a website pointing to the torrent files, asking for more people to seed it. Also, ask seeders to store mirrors and put them wherever they can. 6Tb is not that much.

If you do a plan like this OP, also share the torrent links and explanation to r/datahoarder - those folks are always talking about interesting datasets that they think are useful to seed. Many of them explicitly do so to theoretically fight against the sort of censorship you are actually facing, and would jump at the opportunity to help.
I am surprised that between all these technically minded people, no one suggested making a BitTorrent seeder box and a tracker?
I think that works well for popular torrents (Wikileaks, etc.) but not so well for things that only a few people would download. The seedbox ends up being the primary host anyway, which is kinda what the OP planned already (i.e., a torrent yes, but in conjunction with offsite backup + http hosting).

Bittorrent isn't a guarantee of reliability, it's just a way to distribute files among multiple seeders. But that assumes there will be many seeders, which for something like this probably isn't the case.

Usually, this kind of post is preceeded with "Ask HN:": https://news.ycombinator.com/newsfaq.html.
I wonder why nobody mentioned torrents over I2P: https://geti2p.net. It's distributed, anonymous and anyone can help without risking to reveal their IP. It's slow of course but might be faster if many people join.
The real issue is not making a torrent file, but finding enough people who want to seed it for years to come, whether over clearnet, TOR or I2P.
What torrent clients offer native I2P support?
HN audience should be able to install the I2P client and enter the address in browser to access the torrents.
Internet archive and IPFS pinning on Filebase and web3.storage as well as your own IPFS cluster put together with volunteers. Also can look into Sia and Filecoin.

If you can get two other engineers to volunteer and about 5 or 6 volunteers for buying storage, you can cover all of the above in less than a month.

Easiest is probably Internet Archive though. With a torrent.

Look into CAR files. Once you get those it's easy to pin on IPFS with multiple providers. Some of them free or extremely cheap.

> It will be ran by my own fund/donation with monero.

Well well, seems like cryptocurrencies have a valid use-case after all. ;)

Please don't use hetzner for things like this, atleast for the public site proxy it with another VPS, Hetzner is strict on abuse and if it's against the law of another country it will be taken down.
Cool! I've never seen this kind of offering before.
Everyone here has been jumping in and saying "use my great service" or "use this great service".

But could I be the voice of reason here and jump in and remind the OP and everyone else to avoid putting all their eggs in one basket.

Even if you can't afford to replicate 6TB on multiple services, you should at the very least seriously consider picking, say three hosting services (preferably in different jurisdictions) and putting 2TB in each one.

In terms of offline/non-live backup itself, 6TB is really not that much these days. You could just buy a bunch of high-capacity hard drives, create multiple copies and spread them out via old-school "sneakernet" to your collaborators in different jurisdictions. Sometimes KISS is the best solution.

We've been using fpsync (http://www.fpart.org/fpsync/) to backup our production NAS. It is basically a scheduler wrapped around rsync. We use some 10g NICs for internal backups...it's fast enough that the SSD's are the bottleneck, so we are probably going to switch over to m.2's at somepoint.

Having said that, external transfers across the net will end up being your bottleneck, so you'll have to decide how much compression is worthwhile to reduce the time on the wire.

Another option, if available, is dump it all to drives, and snail mail it, but I'm guessing you already thought of that.

Good luck.

About video,

MEGA.nz offers e.g. an 8TB storage tier for a reasonable price. I'm not bringing them up because of their security+privacy concept –which may suit you– but because of their take on how you can put the material to work (and do work) while hosted with them.

HTML embed videos from storage into your web content and they'll be decrypted on the fly as their watched. Link sharing for selected videos for discretionary views or downloads by receiving users, and of course team member co-op and granular access to said materials.

For starters though, prioritize getting material sent and backed up in relative safety before thinking about functionality. A simple store might fulfill your requirements in the end.

Feel free to drop me an email.

According to Kim Dotcom [0] The Chinese government has a backdoor to MEGA.nz.

[0]: https://nitter.net/KimDotcom/status/1539426611870986240

If he could produce substantial evidence to back up that claim, I'd be very interested in learning more. Until then, he doesn't have the credibility to be trusted.

Based on the assumptions I have about MEGA, Chinese intelligence and their motivations I find it unlikely they have an active, mutual conspiracy (impractical, expensive).

I assume MEGA is as viable target for covert, low-cost try-your-luck attacks as much as any other western infrastructure/enterprise. In that sense and in this post-Snowden world we live in, it seems as likely that any capable nation or five-eyes member et.c. could have* such a "backdoor".

*edit: 'has' → 'could have'

Huh. I just learned that nitter, but not twitter is blocked by the company firewall.

But yeah, I don't trust mega either.

I get daily spam from mega.nz. They can go to hell
This was the first time I heared about your issue, so I checked what they had to say on the matter.

Complaints about email spam with a MEGA email address: https://help.mega.io/security/data-protection/complaints-abo...

TLDR hypothesis - You get daily spam from a third party, using spoofed email headers, but I'm sure you'd figured that out already

Sorry, that is correct. I can't delete my comment. The mega.nz appears only in the from field. I see they use SPF.
You can possibly try to contact dang via hn@ycombinator.com to request that comment to be deleted. As a general rule, replied to comments can't be deleted by a user.
I'm interested in your project, have experience in this field and can give you details of privacy friendly hosting and perhaps guide you through setting this up anonymously+safely. Get a throwaway email/xmpp address and post it on site please
Thank you! We are still in the "back up" stage, because some platforms are at risk of closing down too, now that the new chief executive is in power. I will post a throwaway mail here if we confirm to go ahead with this plan.
ipfs + https://www.hetzner.com/storage/storage-box about 20€/mo, you can also setup an peertube instance to stream those video.
I work in communication science and I know some folks who would be interested to use this for research. If you have the bandwidth, ping me at p@atrifle.net and I can pass the data on. Regarding hosting, I probably can't offer that because public institutions are typically kind of averse to hosting data for everyone.
While I trust that the commenter above is genuine, OP seems to be highly concerned about their opsec. So I just want to post a friendly reminder for the OP to be careful about reaching out to people that comment on this post. For example, don’t reach out using an email that can be linked to your identity.
Yes I thought about that. There are a couple of different threat models. Anyone living in HK right now is in great danger and should be extremely cautious. For them, even posting here may be too risky.

Anyone outside of China may still be at risk, especially if they have family / friends there.

But to add a bit of opsec, the email above is on Protonmail, and if OP wants to reach out they can create a burner account there and email me - that would stay within the service. (cue debate about TLA access to Protonmail, but at least it's probably not the CCP)

> Some of you may know Hong Kong has been attacked by China and most of the pro democratic media companies have shut down... I also would like to host these files on a public domain, just to ensure more people can revisit on these events in the future.

I'm interested now. What happened?

I think it was a while ago. During COVID and right before, China started moving in on Hong Kong, clamping down on dissent and pro-democracy movements, shutting down protests and arresting individuals involved with media companies. It was the end of the Hong Kong free press.

https://en.wikipedia.org/wiki/Apple_Daily

https://en.wikipedia.org/wiki/Hong_Kong_national_security_la...

https://en.wikipedia.org/wiki/Internet_censorship_in_Hong_Ko...

Surprised nobody's suggested this:

10TB HDDs are cheap now. Three of them, with either lvm crypto or some other encryption system.

Three people leave the country on three different airlines to three different countries at three different times.

Then you sort out hosting the data.

Hey KingZil, this may be a little late to catch you, but you can store this on FIlecoin and make it available on IPFS for essentially free (if you get up to a petibyte, it may get as much as $12 per year). I can set you up with an account at https://estuary.tech to help with this.

Also, if you contact me at danny@fil.org, we may be able to put you in contact with other archivists working in this space, such as Starling, and Open Archive.

Best,

d.

How big of an archive are you talking about? Torrent + a few cloud providers in different regions, ipfs pinning can create sufficient redundancy?

Reddit also have data hoarders that might help

> (close to 6tb)
If you want to store a lot of data, the solution is known as an object storage file system.

Check this list: https://datacadamia.com/file/edge_storage

There is a lot of other pricing implied if you want or not retrieve your data but this is mostly one write, several read solution.

Encrypt the files and then use backblaze, reasonably priced cloud hosting. With encryption the nature of the files are irrelevant.
I'm not sure what to make of the monero mention, is it a requirement that you can pay anonymously? or do you have a fund that is able wire money? If you're avoiding the eye of China I suppose it's the former.

I believe there's services to "pin" content to the IPFS network, but I'd be surprised if there aren't torrent seedboxes you can pay for with crypto, but I'm afraid I'm short on specifics. Would the idea be, to keep the files available, there would be a wallet that anyone can pay into to pay for hosting ? It's what I assumed we would get with all these blockchain projects but I haven't really seen it yet, there's filecoin and arweave but I don't know if that they are trustworthy for the long haul what with the market crashing

EDIT: After reading some blogs on FileCoin it seems like they fit the bill, and I think my notion of "anyone can fund the maintainance of this specific dataset" is known as a DataDAO and maybe doesn't exist yet?

Monero would allow donations to be anonymous, so even if all things go south, it should be only my payment to VPS/hosting that would be leaked.

I want to enforce this by only accepting monero, having privacy by default, at the cost of not getting a lot of donations.

Torrent seedboxes might also be useful, as they prioritize bandwidth and (maybe?) disk space. I'm not sure if they'd be more or less secure than the typical VPS hosting, though, and I'd default to pessimism.
This may sound silly, but the people that probably have the most input are porn site hosters.

They deal with exactly these kinds of issues, all the time, and they are a pretty practical crew.

I have never personally used them but IPFS for volunteer hosting and Filecoin for paid distributed hosting might be worth looking into.
Post in r/datahoarder if you haven't already, folks there take data archiving pretty seriously.

Hopefully you're already using Tor.

I make the assumption that all the files you have are intended to be public. If they're not, only host and store encrypted versions using unique keys for each file so that you have the option to provide keys on a need-to-know basis.

I recommend having three tiers of storage (archive, online, and serving). Keep your primary backups on geographically diverse offline storage if possible; it should be enough to find a few people you trust in various countries to store complete copies since ~12TB hard drives are pretty affordable. Checksum everything, sign it, and make that signed list of checksums available for folks to verify that all their files are still intact. I don't have a ton of experience with using offline hard drives for longevity but I would expect that if everyone turns on their drive and verifies the whole archive once a year that you'll have a very low chance of losing any files. Check the Backblaze hard drive reliability posts for suggestions of good model numbers. Some individual disks will die but can be replaced and replicated from another source (probably 2nd tier). The goal of this tier is to not lose everything due to hacking or other attacks or disasters.

The second tier is the online storage. Cloud bucket storage (AWS, GCS, B2) is expensive at $10/TB-month or more, but it is readily available globally and can be secured with pretty good access credentials. It is probably too expensive to serve from buckets directly because of out-bound cloud network pricing. Local online storage is also fine for this if it has a fast internet connection. The goal for this tier is rapid replication of data to either the 1st or 3rd tiers to recover from data loss or spin up new mirrors.

I think for the third tier you should reach out to large CDNs and ask them to help host the large files as a public service to democracy. Failing that, setting up your torrent trackers and web servers on VPSs with rate-limits to avoid huge bills or getting kicked off the provider. Large public cloud instances are also a (expensive!) possibility but require hard-to-anonymize accounts and have pretty good abuse detection systems that will likely make it hard to repeatedly sign up for and host the same content again anonymously. Local/residential hosting in free countries on symmetric internet connections is also an option; plenty of people run tor exit nodes successfully and so you might be able to get enough people to run trackers from home. This is what will cost the most money and time, but it is worth having at least two active mirrors (vps hosts with copies of the website and files) at all times.

Get a few domain names in different TLDs based in different countries, each with a different registrar. This makes it harder for all of them to be taken offline at once. Keep a list of working mirrors visible on each mirror so folks know where else you are hosting if a domain goes down, and point each domain's DNS to a mirror (all modern web servers support SNI for hosting multiple TLS certificates per IP). This buys very cheap redundancy; setting up fully redundant serving behind a single domain requires something like cloudflare or AWS/GCP/Azure load balancers or your own custom front-ends. You could use round robin DNS to point every domain at every mirror's IP, but when one mirror goes down a fraction of users will get a long timeout until they try a working IP.

Keep your configuration files, scripts, web site source code, etc. in git or another form of version control, and make regular backups. Be careful to keep credentials out of version-controlled files. This makes it easier to spin up a new VPS web host whenever necessary, to track work done on the site, and to collaborate with other admins.

Depending on the risk you perceive, if you can trust other admins, split administrative duties up between multiple people so that no one person has administrative control (including passwords, hardware tokens, email accounts, ssh keys, etc) over all the online resources. If you have enough trusted people then shift to a cell-structured network where not all admins know how to identify each other.

Use hardware two-factor tokens wherever possible and watch out for targeted spearphishing attempts.

Good luck!

https://web3.storage

regardless of your opinion of crypto, you can get 1TB+ free filecoin storage per account, and have the data pinned 6 times for about 2.5 years. You can later extend the punning, and add estuary.tech and nft.storage for redundancy.

Fast upload, global distributed backups, fast reliable access, but you'll need to encrypt before upload if you don't want them public.

https://docs.sia.tech/renting/how-to-rent-storage-on-sia How about this? I only host and get a lot of contracts. There is a free 100GB, I tested and the supported videos can be played in the browser. It has a 3 copy redundancy
And it won't last.
pCloud.com has some plans. You can even buy a 10TB lifetime plan. Not affiliated, just a happy customer. Just make sure if you have intellectual property or sensitive things to encrypt before you upload.
just sign up for https://metallic.io/
get in touch with archive.org