Hacker News new | ask | show | jobs
by yatsyk 2310 days ago
CouchDB/PouchDB looks very promising for offline first apps, but I can’t understand how to restrict bad clients. Client potentially could insert document of huge size or execute expensive query and degrade experience of other clients on the same server. Is it any way to prevent this?
2 comments

A couple ways:

One you implement validation functions [1] on user databases to control what kind of data can be inserted into couch. These functions can only be changed by database admins, not users, so can act as a security mechanism controlling what goes in.

As mentioned by others you can also implement a proxy. This doesn't have to interfere with sync functionality, you just have to make sure you proxy all the endpoints in the replication protocol [2]. Envoy [3] is one such proxy that essentially applies document level permissions to a CouchDB database without interfering with sync.

If the goal is just to limit document size, or throttle clients trying to hammer the API, this doesn't even have to be a custom proxy, and reverse proxy with the needed control knobs (such as NGINX) will do. You can of course combine this with validation functions, using validations to ensure the everything that comes in is the right "shape" and using NGINX and it's ilk to apply throttling and sane request limits.

At scale there's a decent chance you want a proxy in front of your Couch instance anyway, since Couch is truly multi-master, meaning you probably want to balance your clients across all your nodes anyway.

[1] https://docs.couchdb.org/en/stable/ddocs/ddocs.html#validate... [2] https://docs.couchdb.org/en/stable/replication/protocol.html... [3] https://github.com/cloudant-labs/envoy

Thank you for pointing at validation, I'll check it. It's not completely clear what is it possible to limit not only particular document but database, or how to handle conflict if document changed on pouch, but rejected on couchdb server.

I'm not sure about current time but previously it was a problem that couchdb file grow until some limit on filesystem and couchdb just crashed.

Start of the envoy readme: it's not battle tested or supported in any way. Also it doesn't do any validation apart from limiting permissions for different users.

It's easier to reimplement couchdb than to create smart proxy that will estimate is this query expensive or not.

I'm not saying about rate-limiting proxy or load-balancing to different backends which could be implemented on nginx or something else.

I'm not clear what you mean by limiting "not only particular document but database". As far as a document changing in pouch and rejected on the server, that's one of two scenarios.

1) The client you wrote is bugged and generated bad data. This scenarios can occur just as easily using Postgres and an application server. What does your app server do if a client tries to send bad data? (Answer: Whatever you told it to do. Most likely throwing a 500 when your databases refuses the incoming data.)

As for what will happen when pouch syncs to couch the server will let everything else sync, but not the bad document. The return value from the API call will tell you what documents didn't sync.

2) Someone is intentionally trying to shove bad data into your database. In this case it's worked as advertised and rejected the bad data. What do you care if a malicious client breaks?

What kind of "expensive" query are you envisioning? Mango queries don't support joins, and only simple equality filters, so in general the worst thing someone could do is send a query that doesn't use an index, but why are you letting the client query the server in the first place? Just have the client sync and query client side. Or don't allow access the the _find endpoint and restrict them to the map/reduce view you handwrote.

If you must let them send arbitrary queries (which to me implies a relatively trusted user, but let's pretend their not), then run the query with a limit of 1 or 0, and examine the execution stats to see if they are using an index, and check their query to see if their limit is reasonable. But at this point you've now entered into a scenario that's going to be very difficult with a custom API too.

> I’m not clear what you mean by limiting "not only particular document but database".

I’ve limited document size to 10mb and ratelimited updates to 10 per second. Client starts to update document with random data 10 requests per second. As far as I understand couch stores all versions at least some time. This means that this one client could fill space on my server 100mb/s. There is no such issues with postgress, and no one allow clients execute raw queries on database without any application server. Document only 10mb but database is huge.

> What kind of "expensive" query are you envisioning?

I have never used couch, so I don’t know what could be expensive. May be some lookup without index or something like this.

Sorry for my ignorance, is it true that if I limit couch only to replication it will not be any not indexed lookups?

Looks like implement secure system with couch is very hard but I can’t find any best practices, mostly only authentication and basic validation.

> I’ve limited document size to 10mb and rate-limited updates to 10 per second. Client starts to update document with random data 10 requests per second. As far as I understand couch stores all versions at least some time. This means that this one client could fill space on my server 100mb/s. There is no such issues with PostgreSQL, and no one allow clients execute raw queries on database without any application server. Document only 10mb but database is huge.

Ah! Now we are getting somewhere! Your concerned about someone filling your disk.

OK, let's modify your scenario a little. Instead of updating an existing document, they create a new document. This a malicious client, why do updates that'll get cleaned up in a few minutes when I can make it permanent?

So, CouchDB allows these writes, and now your disk is full.

What does Postgres with a custom API do? Allows these writes, and now your disk is full.

Your allowing 10MB documents because that makes sense for your application right? So your Postgres table is going to have a binary column or some other column meant to hold bulk data, and your API is going to accept it.

If it doesn't make sense, lower the max document size. Apply validations to limit what fields can be written to, and how big they can be. In Postgres this is called your "schema". Couch being "schemaless", it's now your validation function. Couch is no different from any other schemaless database such as Mongo, RethinkDB and FoundationDB in this regard.

Also your rate limiting here is weak. If I can post to your sever at 100Mb/s second, I can saturate a 1GB link with only 10 clients. Doesn't matter if you reject my posts, if I can send them to the server, I can DOS you pretty easily.

The main thing Postgres gives you here is that it requires you to define your schema upfront (unless you use JSON columns, in which case it joins the schemaless club above). Couch will happily let you not, in which case someone wants to write a record of their car maintenance into your recipe book app? Couch is good with that. But take a step back. what actually stops them from putting that in the "description" column of your Postgres recipe app? Not much. So you have to think about what's important. Do I actually need to make sure these are all the same "shape"? If so I need a validation function. If I can just shrug and say "garbage in, garbage out", then I just need controls around how much data they can insert, but hey, I needed that for Postgres anyway.

> Sorry for my ignorance, is it true that if I limit couch only to replication it will not be any not indexed lookups?

Correct (enough). The entirety of CouchDB is built around efficient replication. While it's not going to use a formal "index" getting all of the changes after a specific rev is an efficient operation.

It’s trivial to limit number of created documents in postgres, couchdb or application server though validation, I’m talking about updating document not creating new. In posgres if I update 1mb document used space will not always grow. In couch db situation is different. In case of relation db you have application server with custom logic and validations, couchdb from other side is accessible from outsize.

My idea that it’s very hard to create safe couchdb based system and most recommendations limited to setup nginx proxy and authenticate users which is not enough.

Also as far as database size, I don't believe there is a hard limit. I think you might be thinking of when MongoDB would silently corrupt databases larger than 2GB on it's 32-bit version.
As far as I remember it was a filesystem limit not couchdb limit. It was a problem that file always grow and couchdb crashed when limit exceeded. Can't find particular issue, but googling show some issues [1] that make me think that we should be very careful with db size.

[1] https://stackoverflow.com/questions/40752578/couchdb-views-c...

You resolve that issue the same way you would resolve the same issue if you were using Postgres - you introduce some back-end.

For your example specifically I'd use a proxy.

Custom backend means no synchronisation and no advantages over postgres.

Do you propose to create proxy that parses query and estimates complexity? I think this task at least as hard as implementing couchdb myself (actually harder)

Is there any secure open source code with pouchdb/couchdb integrations?

Your backend can be a reverse proxy that authenticates requests then passes them off to CouchDB (or PouchDB, since that also runs on the server). I have an example up @ https://github.com/daleharvey/noted. The server is 200 lines and does signup / email authentication etc.
This server can't prevent authenticated user from uploading huge document of running expensive query.
Any reverse proxy can limit the the size of a document upload. Even just plain NGINX can do that. Just set the client max body size.

As for queries, it kind of depends on your model. Mango queries are pretty limited (no joins, no arbitrary filters), so it's not necessarily as easy as you think to write one that hosed performance. A client could of course write one that doesn't use an index, which may or may not be a concern.

An easy option if it is though is just don't expose the `_find` endpoint, which effectively limits your users to the map/reduce queries you've written (unless you give them admin they don't have the ability to create their own).

A popular model is for the clients to run the queries locally, the server doesnt need to expose any query endpoints, only the ones necessary for replication.
I wasnt worried about that since it is a basic proof of concept, adding that would make it ~210 lines of code.
There are plenty of proxies that do that with some config like nginx. Even if you were using a relational database with a backend you’d still have to solve the same problem.
If I use backend I can create all validation logic in application server. But in this case no automatic synchronisation.

One of the major selling point of couchdb is replication protocol for client-server data syncing. When you design product with posgress you don't allow to execute raw sql queries from clients without any application server. But looks like it is recommended way to update data in couchdb world if you want to have synchronisation. I can't understand how can this architecture be secure?

Couchdb has options for controlling which documents are replicated. This may help depending on your use case.
nginx couldn't solve the "execute expensive query" though right, only limit max size. I guess you could do a request timeout + blacklist, but that would also be hard to do right, since at heavy load some proper clients might get blacklisted.