Hacker News new | ask | show | jobs
by hannofcart 1061 days ago
One of the better decisions we took at my firm was to not allow direct access to any production DB to analytics visualization tools like Metabase and Redash.

Always write your analytics data to a separate DB in a periodically run job. Only store aggregated anonymized data in the analytics DB you expose to internal stakeholders via tools like Metabase.

5 comments

Also your production database is optimized for different workloads than your analytics database.

Usually production is used for fetching and updating a small number of records at a time (think updating a shopping cart), and has strict latency requirements whereas analytics involves reading a large amount of data in columns (think count group by one or two columns), and can be done in batches where the results can get a more and more stale until the next batch runs.

How do you batch write the results (say updating shopping carts) when frontend has to reflect whats in the database?
They're talking about moving data between two different back end databases. Your production database optimized for your application/latency.

Then you have your warehouse database that you updated once a day with information from prod.

That's a great idea and it articulates something I have thought about the whole "use boring tech" things (which I support). It doesn't preclude letting people use the shiny new thing. You can always let them plug it in and use it. But the core of the system should be as simple as possible and based on thoroughly understood tech (from the point of view of the team in question/accessible labor market).
I tend to discuss things in terms of the trunk, branch, and leaves.

Mostly in that the leaves of your system (parts that nothing else connects to or builds on) are generally a low risk place to try new things sometimes. If you do run into any intractable issues, it’s also an easy spot to pluck it off and replace it.

Worth pointing out that we recently discovered an RCE in RestrictedPython that affects Redash: https://github.com/zopefoundation/RestrictedPython/security/...

This should further emphasize the need to isolate these tools and ensure they are only accessible to people who need them.

Exactly right -- we do all of that, and even then tightly control and audit who has access to the anonymized, aggregated, read-only data cube.
What kind of tooling do you/people use for that? Or just custom scripts?
Look up OLTP vs OLAP data stores to get an idea. There are a lot of common patterns for the specifics of implementing this. Usually you run a regularly scheduled job that dumps data representing some time period (e.g. daily jobs). There are some considerations for late arriving data, which is a classic DE interview question, but for the most part, big nightly dumps of the last day’s data/transactions/snapshots to date-partitioned columnar stores using an orchestration engine like Airflow is sufficient for 99% of use cases.
Tangent: I hate OLTP and OLAP as acronyms. They're only one letter/word off and completely obscure the relevant meaning lots of semantic noise. Just say transactional vs analytical processing. (They are still good search key terms because lots of existing literature/resources use the terms)
(not the person you're replying to)

I can't recommend any specific tools without knowing a lot about the environment, but if you're looking for terms to google: ELT (Extract, Load, Transform) and CDC (Change Data Capture) will give you a sense of the landscape.

edit: the sibling comment that mentions Airflow is a good answer for an example of an ELT workflow.

Don't Maria, Postgres, etc make replication pretty easy?