Hacker News new | ask | show | jobs
by tonymet 876 days ago
BigQuery public datasets would be a better hosting platform for this kind of data. I worry they are not anticipating the security & budgeting issues of hosting a real-time API.

With PUblic Datasets, the account making queries pays for the queries. NPS only pays for the storage (which is minimal).

With this API, NPS has to pay for every call to the API. That’s not cheap.

11 comments

Requiring use of a private party to access public data is usually something we discourage.
I agree with this in theory but in practice it would be unrealistic and honestly a misuse of government funds for them to reinvent and maintain a fully in-housed stack for all of its digital services.
At what cost?
They’re hosting on AWS . So either taxpayer pays for hosting, or the customer pays .
Parent thread said:

> to access public data

Keyword is access. Hosting on AWS is an implementation detail that doesn't block the end consumer from accessing the data.

There are 4-5 other assets the customer needs in order to access it. So one more wouldn’t be a big deal.
Tell that to US government agencies publishing their announcement and other news on Medium.com.
Do you need anything other than a web browser to access it? Medium is just the server that it’s hosted on.
you have to accept TOS and in some cases pay for a subscription
Paying for subscription is only when the publisher has opted into monetization. Which isn’t the case for US government agencies.

That said, I hate Medium with a passion and that things like the Netflix tech blog are hosted there.

Private parties that the customer needs to pay access this NPS public data:

* AWS

* Comcast for their internet service

* Apple for their laptop

* A number of software providers for their development tools.

But asking the customer to pay google to query the data is crossing the line?

This argument is nonsensical.

Site hosting is not a customer cost.

The rest you list are costs orthogonal to this service.

> But asking the customer to pay google to query the data is crossing the line?

Yes.

Why are you arguing for a US government agency to require its citizens to pay for access to data which they have already paid for by funding said agency?

Well, no, a customer has choices for most of those, because the government isn't hosting the data exclusively with a private vendor that charges the customer for access, providing an exclusive franchise to that vendor.

That was what was suggested upthread.

Requiring the user to have certain capacities to access data, where those capacities are provided by a number of competing vendors (and some by free, gratis and/or libre sources) is a very different thing.

NPS is hosting this data on AWS , a private vendor. And NPS (ultimately taxpayers) pay for every query.

So are you ok with some chinese APP company making 50 crappy NPS themed apps and having taxpayers pay for the backend?

> So are you ok with some chinese APP company making 50 crappy NPS themed apps and having taxpayers pay for the backend?

I will make that trade every day of the week if it means access continues to be through a standard protocol (HTTP) and not beholden to any particular vendor.

> So are you ok with some chinese APP company making 50 crappy NPS themed apps and having taxpayers pay for the backend?

Addressed in a comment in another subthread, which I know you are aware of since you responded to it, too: https://news.ycombinator.com/item?id=39086270

Why would someone who just wants to access the data need to pay for AWS? And the rest can be avoided by using a library PC & open source software. Or more likely, are already things almost everyone has on hand anyway.
Every request to EC2 costs money. TANSTAAFL
> TANSTAAFL

What?

There Ain't No Such Thing As A Free Lunch. It's an old saying meaning nothing is really free. If you aren't paying money, you're paying some other way.
I was hoping it was a PEMDAS for AWS costs.
What's stopping me from accessing this without an AWS account, over Frontier, on a Thinkpad?

That's the difference.

You’re thinking small. I’m thinking big. That’s the difference.
You're arguing in favor of making consumers require an account at a company that's already centralized too much of the web.

That's fundamentally a lot worse than the government paying hosting costs to one particular vendor for a commodity service.

You really aren’t. You’re talking like someone who is willfully ignorant of the decades of internet history that have preceded this conversation.

People have quite literally died over the issue of public access to public data. It’s quite an important belay point to arrest the deterioration of the spirit of open networks.

Host it as csvs as a backup
Majority of your bullet points would be circumvented by running your own server and developing on a linux OS.
Where do you host this hypothetical server? How does it get internet access?
FYI, you can edit your posts for an hour. Instead of reposting, just add your new thoughts onto your previous comment?
Who makes servers?
The federal government pays Comcast to provide internet to low income households. And you can actually access this data on any old brand of laptop, or the desktop computers provided for use in most jurisdictions, and do not have to pay for any development tools to do so.
My second hand laptop isn't apple, my host is a raspberry pi, not AWS. I don't use comcast - I have a wide choice of providers including free ones (at my local library), and I've never paid for a development tool
> NPS has to pay for every call to the API. That’s not cheap.

I am perfectly fine with it being considered part of the basic, taxpayer-supported functions of government agencies to be providing the public with relevant data.

If there is a concrete abuse or wildly disproportionate cost problem in particular areas, that may need to be addressed with charges for particular kinds or patterns of access.

You might be fine but any taxpayer expense must be justified and cheaper alternatives explored. This is someone else's money so it is very easy to feel entitled but every penny saved here can go into other better things like conservation, infra in parks etc.
At what cost? Rest APIs are very expensive ways for the government to make CSV data available to the public.
They are a whole lot less expensive than tracking customer usage and billing for it, and a whole lot more useful to the public than having the data nominally publicly accessible but only "on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying Beware of the Leopard." [0]

[0] Douglas Adams, Hitchhiker's Guide to the Galaxy

not in this case that's my whole point . they are choosing the most expensive (and riskiest) way to make csv files available to the public
There's likely some truth that CSV would work well here, and would likely be cheaper to operate. I wouldn't be surprised if a lot of clients are or could be doing full transfers of the data and doing their own queries.

I'd be pretty happy with sqlite dumps too.

I don't really have an issue with the REST, though. I wouldn't be surprised if this was just a standard and cheap to set up Django+REST libraries stack. Yeah, the compute costs are higher than transferring static files, but I'd be shocked if this was taking enough QPS for the difference in cost to make a meaningful difference.

I get wanting the government to be responsible, but this veers a bit too far into Brutalist architecture as an organizational principle.

Yes, and the flip side of that is you require people to have and use Google accounts to access public data? That’s not exactly ideal.
And at what cost. Why are we paying for customers to query this data in realtime?

So someone can host a ripoff NPS app on the App Store and taxpayers now pay for content hosting?

The USPS provides a free service for address validation. You have to register an account and receive an access token. If they feel your token is using too much, they can handle it as necessary. Why this same concept couldn't be done in the same way is just lack of imagination.

You can access for free, but if you abuse or break the TOS your access is revoked. Done

USPS is a corporation with profit and loss. I still consider this irresponsible but theoretically I'm not paying for their largesse
Fine host it as CSV as a backup for the luddites.
> BigQuery public datasets would be a better hosting platform for this kind of data. I worry they are not anticipating the security & budgeting issues of hosting a real-time API.

Then use their API to populate a BigQuery public dataset and make available to all.

Otherwise, perhaps we, as outside observers, need to consider the possibility that those whom made the decisions to provide this service as such did so for reasons which we may not be aware.

ok good suggestion here you go

nps-public-data.nps_public_data

Looks like access denied? Good work!
the following tables are now available:

parks

people

feespasses

REST APIs are extremely battle-tested, easy to integrated with, and far more mainstream than BigQuery public datasets or any other niche technology that may or may not exist at some point in the future. If cost is truly an issue, perhaps the solution is to properly fund the NPS so it can make smart technology decisions.
every rest api implementation is bespoke. what does "battle tested" mean in this sense ?

Sure the concept of rest APIs is mature, the but each implementation is untested.

at what cost?
I’m not sure if you’re serious (given the spammy nature of your posts, I’m inclined to believe not), but given that REST is the de facto standard for exchange of information between machines across the internet, I think the onus is on you to estimate how much money you think the NPS stands to save by doing it your way. Then the rest of us can evaluate whether that’s a good tradeoff.
We only know based on the information we have that they are sharing CSV files via REST on AWS/EC2. That’s the most expensive way to share it, and also risky .

What’s spammy about my post? I have asked people to focus on costs when they make general statements like “all govt data should have a REST api”.

Do you actually know that the only source of the data is a static CSV? Or is that just speculation?

I think we can dispense with the risky argument, because this API has existed for years without issue.

When you make the argument that "X is too expensive," the onus is on you to prove it's expensive in a relative sense, not simply in an absolute sense. Saving $100 matters if you're spending $1000; it probably doesn't matter if you're spending $10m. Feel free to convince us: make some estimates, crunch some numbers, and look at existing NPS IT spending and see if they seem ballpark reasonable. Otherwise you're just banging on about a left-field solution that almost no one wants because it's putatively cheaper (but by how much, you can't say).

all the APIs but one are static json
CSV is also a de-facto standard and is more common than REST at 1/100 the cost
Do you have numbers to backup the cost savings estimate? I can imagine lots of REST implementations that are really inexpensive.
hosting csv on S3 vs hosting ec2 instances for an apache rest api
Looking at the data made available by this API, I think it's safe to say this is fine.
This is a fascinating thread under this comment. Everyone is keying off of one part of the comment (querier pays) and not the more critical issue IMO - anticipating security and budgeting issues of hosting a real-time API. You suggested an alternative and everyone is pitting the status quo against that alternative instead of maybe looking for other alternatives that help address the issue.

People here clearly don’t like a querier pays model and that’s fine. But should NPS still reinvent the wheel across the SDLC to serve this data? I think there’s a compelling argument in there.

Yes thank you for noticing that. My bigger concern is NPS paying for expensive auto-scale resources for what is basically CSV files that could be hosted cheaply and securely.

REST API compute is very expensive when you include compute costs, transfer fees and admin costs to keep it up.

Not to mention the cost to implement a bespoke API and deal with security issues.

All to make CSV available!

On the list of alarming or even questionable things our taxes pay for, this doesn't even make the top 100.
start a thread on one of those let's discuss
I'd consider it a public transit service. We wouldn't be upset about people using shuttle busses to get to the parks, would we? I think long term footing the bill for an open platform with principle beneficiaries who use it is fine so long as it provides a net benefit.
If you have to pay for REST API OR shuttle busses which one gets funded?
With their API they have to write a bunch of boilerplate code to transform from their SQL db to REST. Authentication, throttling, threat prevention, encoding, etc etc.

With BigQuery they just copy the data in via CSV and Big Query handles the indexing & query engine.

> With their API they have to write a bunch of boilerplate code to transform from their SQL db to REST.

Open source tools that will present a simple, read-only REST API over an SQL db with little to no custom code exist (so do proprietary tools, often from DB vendors and sometimes as part of SQL db products.) Same with NoSQL or sorta-SQL storage solutions.

The idea that they have to write a bunch of boilerplate code to do this is false. They might choose to do that, but its defintely not necessary.

> Authentication, throttling, threat prevention, encoding, etc etc.

Again, open source canned solutions that take a little bit of configuration exist for many of those, and some of them are likely shared services that they just point at whatever service needs them.

Who says they have a SQL DB? This looks to be almost entirely static data, occasionally updated.
Whatever storage format they have, they are writing boilerplate to transform it into REST . Regardless, it will be cheaper to just ingest into BigQuery
It’s Apache Solr. Most of the data is static, but alerts and events get frequent updates.
Same concern about unnecessary code and compute stands.
Are you suggesting the government put their public access API behind a paywall?
This data domain doesn’t need a realtime api. They could host CSVs online with some mirrors and save millions of dollars hosting this stuff.
Ah yes, I see now. Yeah makes no sense to offer a REST response for each request.

On that note, what would the processing entail? Processing the get request and packaging the entire dataset into a REST object right? Or is it a more complex API that lets you run queries against the dataset? For that matter wouldn’t downloading a CSV also have to be packaged into a REST object?

the rest API provides parameters for filtering & pagination. all of that is unnecessary. it's a few hundred MB tops . CSV , Bigquery , anything is better than running REST on EC2