I'm not the GP, but yes, absolutely. There are plenty of things that make this less than awful:
- The existence of tools that allow structured access to multiple APIs (GraphQL is a nice middle ground between "YOLO any queries you want" and "you only get row-by-row access exposed by the web APIs").
- The existence of data on multiple internal data stores. Analytics folks usually are not prepared to engage with the complexity of data being stored across handfuls or more of different stores with different schemas. The owner of the application knows how to join that stuff better than they do.
- Building intermediate/denormalized stores isn't frowned upon just because analytics shouldn't run ad hoc queries on the main production DBs. Expose change streams or bulk ("too much" data) endpoints and make it easy to load their results into a reporting system, which can be raw SQL. It's not redundant; if you don't do this, the following conversation starts to happen often: Q: "I'm running raw analytics queries on production and it's not quite working, can we just make $substantial_schema_change so my report works/is fast?" A: "No, we explicitly chose not to structure the DB/index/whatever like that because it seriously fucks up a real user access pattern."
Forcing analytics to go through the API doesn’t actually reduce load on the production DB, it just increases load on the API itself. Step 1 should probably be a dedicated read replica and step 2 should probably be an ETL process.
Ding ding ding. Dedicated read replica and an ETL gets you to a point where queries don't bring down prod. If you have an analyst org running wild making bad decisions about data that they think says things it doesn't -- that's probably a good sign that it's time for a dedicated data engineering team, and potentially a BI flavored data science team as well.
Analytics queries bringing down prod seems . . . pretty amateur hour. I'm more interested in whether or not analytics queries actually get the data they're interested in when they want it. The reporting team is likely not better versed in what means what than the developers who work on the application databases. What about multiple internal DBs that reporting wants to analyze as if they were one? What about schemas that change over time, obsoleting the analytics team's assumptions? Reliable, versioned data access APIs address both of those families of problems. Yes, it's harder than "YOLO query prod". It also works for longer without breaking, and jives with the scale out plan (usually discrete APIs, sharding, and then maybe microservices and more families of APIs if you're mature enough).
> Reliable, versioned data access APIs address both of those families of problems.
They only address it in so far as they push it downstream to the analyst, who as you mentioned, "is likely not better versed in what means what than the developers who work on the application databases."
There's a reason why datalakes exist, and having used them at past N companies, I think this is why data engineering of the BI flavor becomes necessary at the point that reporting becomes critical. An API is strictly worse than a datalake, and it's not hard to set up and maintain the latter. API versioning and communication are for frontend integrations, paid partner integrations and potentially (although I'd probably lean more on gRPC and the ilk) microservice to microservice interactions. But, I like to avoid building unnecessary API surface when I can.
- The existence of tools that allow structured access to multiple APIs (GraphQL is a nice middle ground between "YOLO any queries you want" and "you only get row-by-row access exposed by the web APIs").
- The existence of data on multiple internal data stores. Analytics folks usually are not prepared to engage with the complexity of data being stored across handfuls or more of different stores with different schemas. The owner of the application knows how to join that stuff better than they do.
- Building intermediate/denormalized stores isn't frowned upon just because analytics shouldn't run ad hoc queries on the main production DBs. Expose change streams or bulk ("too much" data) endpoints and make it easy to load their results into a reporting system, which can be raw SQL. It's not redundant; if you don't do this, the following conversation starts to happen often: Q: "I'm running raw analytics queries on production and it's not quite working, can we just make $substantial_schema_change so my report works/is fast?" A: "No, we explicitly chose not to structure the DB/index/whatever like that because it seriously fucks up a real user access pattern."