| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by erik_seaberg 2504 days ago
	A lot of people who believe only one app (or one language) accesses their org's datastore are mistaken. You have to take extreme measures to prevent ad hoc uses from popping up.

1 comments

philwelch 2504 days ago

Yes, yes, yes.

Why is this the case?

1. If you are doing anything interesting, people are going to ask questions about what you are doing, and the best way to answer those questions is going to be by querying your database.

2. One day you might want to rewrite some of your service/s, split them into microservice/s, etc. At that point, there will be a minimum of two services talking to your datastore: the legacy service and whatever you're replacing it with. I suspect any alternative to this arrangement will be an even worse idea, e.g. taking a deliberate outage to perform a likely-irreversible migration.

link

zbentley 2504 days ago

> One day you might want to rewrite some of your service/s, split them into microservice/s, etc. At that point, there will be a minimum of two services talking to your datastore.

You should not do this. It removes almost all of the benefits of extracting things into a separate service (services should own their data and the only means of accessing it should be via their APIs). That's not utopian; that's one of the main reasons you do a service extraction in the first place.

link

philwelch 2504 days ago

Right, so let’s suppose you already segmented the data to two different backing datastores, and your monolith is now connecting to both of them instead of just the one. Now you can do the service migration, at which point you still run into the situation I’m discussing.

link

zbentley 2503 days ago

Cutovers are hard, to be sure. Ideally they should also be short (the time time a service undergoing mitosis spends talking to the old and new locations should be measured in days or hours or less).

Don't choose general data access patterns for the infrequent occurrence of cutover. Cutover is when you break a few rules and then immediately stop doing so. Build for everyday access patterns instead (which should be through the API of whatever owns the data--SQL is a powerful language and a really shitty API).

link

philwelch 2503 days ago

Stored procedures are a better API than arbitrary SQL. You may even be able to enforce it by granting EXECUTE permissions but not SELECT permissions.

link

vips7L 2504 days ago

The simple solution to 1 is to never allow direct database access. Api only.

link

pnako 2504 days ago

Of course. But surely you don't let anyone access your API, and you put it behind another API, right? Just in case you need to change that first API without breaking all the users.

link

xwolfi 2504 days ago

Never even tell you have one, else the founder will pat on the back of one of your most junior dev and ask if he can give access to the db to that other team who needs to make money :D

link

philwelch 2504 days ago

So you do all your analytics by running a series of service calls and then writing a script to collate them into the needed results? Seriously?

link

zbentley 2503 days ago

I'm not the GP, but yes, absolutely. There are plenty of things that make this less than awful:

- The existence of tools that allow structured access to multiple APIs (GraphQL is a nice middle ground between "YOLO any queries you want" and "you only get row-by-row access exposed by the web APIs").

- The existence of data on multiple internal data stores. Analytics folks usually are not prepared to engage with the complexity of data being stored across handfuls or more of different stores with different schemas. The owner of the application knows how to join that stuff better than they do.

- Building intermediate/denormalized stores isn't frowned upon just because analytics shouldn't run ad hoc queries on the main production DBs. Expose change streams or bulk ("too much" data) endpoints and make it easy to load their results into a reporting system, which can be raw SQL. It's not redundant; if you don't do this, the following conversation starts to happen often: Q: "I'm running raw analytics queries on production and it's not quite working, can we just make $substantial_schema_change so my report works/is fast?" A: "No, we explicitly chose not to structure the DB/index/whatever like that because it seriously fucks up a real user access pattern."

link

philwelch 2503 days ago

Forcing analytics to go through the API doesn’t actually reduce load on the production DB, it just increases load on the API itself. Step 1 should probably be a dedicated read replica and step 2 should probably be an ETL process.

link

yowlingcat 2503 days ago

Ding ding ding. Dedicated read replica and an ETL gets you to a point where queries don't bring down prod. If you have an analyst org running wild making bad decisions about data that they think says things it doesn't -- that's probably a good sign that it's time for a dedicated data engineering team, and potentially a BI flavored data science team as well.

link

troxwalt 2504 days ago

What do you mean by this? What other way , other then digging right into the data is there to access the database? Isn't it all through APIs?

link