Hacker News new | ask | show | jobs
by Dunedan 2288 days ago
> “However, when end users download data from Earthdata Cloud, the agency, not the user, will be charged every time data is egressed.

Not necessarily, depending on how the users access the data. If users access the data through their own AWS accounts, NASA could leverage S3's "Requester Pays" feature [1], to let the user pay for downloading the data.

1: https://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPay...

5 comments

I immediately thought about this as well, however I seem to recall reading somewhere (and I could be entirely wrong here) that NASA has a requirement to give away freely their science data.
If there's a marginal cost for each copy of the data that's transferred to a user, I don't think asking the user to cover that cost conflicts with a requirement to "give away the data".

(If they distributed their science data in printed form, surely they'd be allowed to charge people for the cost of printing & mailing the paper copies; that's quite different from charging for the data itself.)

Why the downvotes? This isn't uncommon or unreasonable if you're downloading TB's of data. Also the data would be freely redistributable if someone took the data and put up a torrent. Still I'd rather see NASA host their own data. Put up an FTP server, torrent server and save a lot of money on hosting fees.
While proxying through a torrent system is a good idea. I doubt it would get well seeded outside a few popular datasets- the agency would end up the sole seeder of the long tail.

I’m willing to bet NASA saves a ton of money by going to a cloud provider- US government storage setups are insanely expensive. I remember a project I was on got a quote of over $10,000/TB in 2014, and there is no way egress is actually free right now- they are paying for a government regulation compliant internet connection one way or another.

I do worry about vendor lock in to a degree, but I’m confident the agency and tax payers would save money going to any major cloud provider.

Sounds like there is a bigger story there and it's probably a managed SAN.

I've operated pretty significant government shared infrastructures like this in the past... we were offering fast, flash-cached disk in 2010 for about $5,000/TB. $10k/TB is not unreasonable for highly available Tier-1 storage for something like SAP, especially in that era where you couldn't use all flash in most case.

Today, cost structures can be very different. You can land high-iop storage for a fraction of the cost without the overhead of a big SAN. If you need capacity focused storage, that is also much cheaper.

An agency like NASA gets hosed on services, and cloud is no different. AWS is probably a net savings for operational workloads whose characteristics are known. Backup is a no-brainer. But for a high-volume, operationally highly variable thing like a public archive of data, AWS a square peg in a round hole because of the metered access.

I’m sure that $10k/terabyte quote was complete overkill for what we needed- but that’s what the stove piped storage org was offering, and it killed the project we were working on.
Wow! That's good to know, if a bit disheartening. I guess I was thinking costs for small startup costs with some cheap-ish linux raid setups and likely massive fiber taps NASA must surely already have. Not government/big business costs.
What causes a cost of $10000/TB? Even with multiple redundant failsafes I just cannot see how the cost could run up to that.
In 2014?

You'd be buying something like an EMC vMax that can sustain 1M+ IOPS on lots of 15K spinning drives, with caching tiers on crazy expensive flash.

To support that, you need a fibre channel network layer and a bunch of FTEs to attend to it. Usually compliance requirements require segmentation of roles, which increases cost. If you're a federal government entity, those FTEs are most likely contractors billed out at $125-300/hr. Figure $3-5M/year on labor costs alone, although that may be divided out over multiple systems.

This happens in commercial business too. I had a buddy who was making about $150k in NYC to zone luns on a SAN. Basically he kept a spreadsheet and updated a specific configuration setting 2-3x a day and spent about 60-90 minutes/day doing that. The rest was waiting or studying for his MBA.

It's pretty wacky to compare S3 to this type of storage.

By the way, depending on where it's hosted, S3 can seed torrents automatically: https://docs.aws.amazon.com/AmazonS3/latest/dev/S3TorrentRet...
Records departments always charge for copies, and that is the use I thought of immediately when I learned of Requester Pays. I’d be surprised if NASA couldn’t use it.
Why FTP - torrent it all the way, perhaps have the AWS as nodes...
> If there's a marginal cost for each copy of the data that's transferred to a user, I don't think asking the user to cover that cost conflicts with a requirement to "give away the data".

Charging the user for data, even if it is on a marginal cost basis, conflicts with a mandate to give data away freely. Because “at the marginal cost of delivery” is not “free”.

(It's true that it is common for mandates to specify something like at marginal cost of delivery rather than free—sunshine laws providing copies of public records often work that way—but that's not the applicable mandate here; in fact, since without the separate mandate here the data would be available on a marginal cost basis under FOIA, the main reason for a separate mandate is to negate that cost.)

Do you have a citation for the "mandate to give data away freely"?

I found https://nodis3.gsfc.nasa.gov/displayDir.cfm?t=NPD&c=2230&s=1, which mentions things like "Ensure public access...", but I don't see anything there mandating such public access to necessarily be at zero cost.

Also, public access can mean that once someone gets a copy of the data they can host it for free as well. It's not as if it's under a commercial license.
While the data is free, the cost of getting the data to you can be charged. Originally, it was to cover the expense of someone pulling the data, making copies, and then mailing that data out to you. If it was photographic, you'd be charged for the prints. I'd see using Requester Pays in the same vein. They are not charging you for the data, but any fees incurred to obtain the data would be at your expense.
Isn't requestor pays just like I pay for gas to drive to my local library, when I can't bike because I want to borrow so many books, but the books are free to loan.
It's more like we both have a library, the books are free, but if I want to take some of your books I have to pay for shipping.
I'm pretty sure its like when I buy a book, and than I pay for it.
It's required to be public domain. IMO it's comparable to FOIA requests still requiring the requester to attach a stamp to the envelope their request goes in. Or at most, include a self-addressed stamped envelope too.

Requiring you to pay S3 is little different than requiring you to have Internet access, and thus pay whichever company includes you in THAT monopoly, IMO.

To me it feels very different.

Imagine for a moment that in order to access NASA data sets you had to have a Fastmail email account. Gmail won't work, Outlook won't work, it has to be Fastmail alone.

That would be very objectionable (as much as I adore Fastmail).

Ability to pay one specific cloud provider should not be a gate for public domain government data.

I don't think this analogy works. For Fastmail, there is a cost regardless of whether you want to access government data. You have to pay for the account itself. For most cloud providers, there is zero cost for having an account. Even if they hosted this themselves, they could just as likely charge for data transfer costs...and get to choose how to collect that. They could choose PayPal and you have to create an account. Or they take credit cards...and you must have a card belonging to one of the networks they support. The barrier to entry doesn't change regardless of how many cloud providers there are, all it does is increase infrastructure costs unnecessarily.
The alternative here, though, to get comparable distribution / durability, etc. by spending way more of the public's money upfront regardless of who wanted it. I get the purist / idealistic argument here, but it feels a bit like cutting off one's nose to spite their face.
I'm not an expert, but most government agencies are allowed to charge reasonable fees for access to their data. I don't know if this qualifies, but it at least seems like a possibility, especially if it's transparently just passing along their costs in the form of AWS' own cost structure
This then requires that everyone have an AWS account and billing relationship with Amazon to access public data.
I wonder if there is a problem with this because it requires you to have an Amazon account and such to do it. There is now a much higher barrier of entry for random people to access small amounts of data. And no longer have direct http links. You have to use the CLI / SDKs once requester pays is on there.
And this would be an even worse outcome.
Why?
Because it allows the agency to escape from its bad design problems by pushing the (huge) cost onto its clients -- and those clients are other parts of the US Govt or funded by the US Govt.
You're asserting the design is flawed when that's in dispute.

It's useful for those agencies' budgets to reflect a portion of the cost of performing that research.

The USG needs insight into what taxpayer dollars are being spent on. Lawmakers have to explain to constituents why that money is being spent.

NASA is the first tier of information, collecting the data. Its budget ought to reflect that cost.

The consuming agencies are the second tier, processing that information. Their budgets reflect the cost of gathering their information and of processing it.

NASA doesn't know which information will be useful, so it's not helpful for them to pay the cost of egress. We want them to collect as much as possible.

It's much like a music store, 90% of their sales come from the top 10, but there's a lot of value in hosting obscure stuff.

If they have to pay to store it all rather than pay for egress, they'd have to justify the cost storing data that they can only say "it might be useful some time."

Having the agencies that are working with the data pay for the egress, they can justify the cost by showing the specific work they do.

The missions are already funded on the basis that they will store and share the data.

But you're arguing for inter-agency billing as the correct way to weight scientific experiments? That isn't rational.

I'm a huge fan of requester pays, and I frankly don't understand why we haven't switched more of the internet to it.

I'm also a liberal, so then I also think government should give everyone a monthly quota of internet usage allowance. Universal Basic Internet Income, or something.