Hacker News new | ask | show | jobs
by tang8330 1053 days ago
Thanks for your feedback!

> Something McKinley doesn't address is that it's quite advantageous if the values in your data warehouse don't change intra-day because this lets business users reach consensus. Whereas if Bob runs a report and gets $X, and Alice runs the same report 5 minutes later and gets $Y, that creates confusion (much more than you would expect). I recall a particular system I built that refreshed every 6 hours (limited by upstream), that eventually Marketing asked me to dial back to every 24 hours because they couldn't stand things changing in the middle of the day.

If they want to see a consistent view of the report, you could bound this.

1/ SELECT * FROM FOO WHERE DATE_TRUNC('day', updated_at) < DATE_TRUNC('day', DATEADD(day, -1, CURRENT_DATE()));

If your dataset doesn't contain kv, you can turn on include `artie_updated_at` which will provide an additional column with the updated_at field to support incremental ingestion.

2/ If you had stateful data, you could also explore creating a Snowflake task and leveraging the time travel f(x) to create a "snapshot" if your workload depended on it.

3/ Also, if you _did_ want this to be more lagged, you can actually increase the flushIntervalSeconds [1] to 6h, 24h, whichever time interval you fancy. You as the customer should have maximum flexibility when it comes to when to flush to DWH.

4/ You can also choose to refresh the analytical report on Looker / Mode to be daily. [2]

> Now of course I see you're targeting more real-time use cases like fraud detection. That's great! But why you would run a fraud detection process out of your data warehouse, which likely doesn't even have a production-grade uptime SLA? Run it out of your production database, that's what it's for!

You can certainly do this in production db (that was our original hypothesis as well!), however, after talking to more companies...it has become more obvious to us that folks that are running fraud algos actually want to join this across various data sets. Further, by using a DWH - it provides a nice visualization layer on top.

Of course, you could go with something even more bespoke by utilizing real-time DBs such as Materialize / Rockset / RisingWave. Just comes with trade offs such as increase in architectural complexity.

There are also plenty of additional use cases this can unlock given that DWH is a platform, any post-DWH application can benefit from less lag, such as reverse ETLs.

[1] https://docs.artie.so/running-transfer/options

[2] https://mode.com/help/articles/report-scheduling-and-sharing...

2 comments

I think you missed the parent's point - your USP is real-time replication. So everything you're proposing makes it not real time. Your USP is now worthless (in that context) and you're competitors are numerous.
Hm, perhaps I wasn't being clear, apologies for that.

What I am proposing above is ways to provide a view to teams that do not want real-time data while keeping your underlying dataset in real-time.

Huh? The parent's point was your underlying dataset is always in real-time. There's no issue querying a data warehouse when all you're doing is looking for a simple transactional report.
I think their point is they have a real-time warehouse that can also be used in “stale snapshot” mode.
I agree with the parent point. I also don't think DWH is the primary usecase for your platform.

I have seen architectures where databases are siloed within departments and data has to be replicated across department physical databases in the same network or different, mostly in banks, insurances and old school industries. In this scenario, a daily batch would run that would replicate and populate the tables and kick start business processes. A platform like this would make sense. Another usecase, i can think of is reverse ETL, but there are many tools custom made for that.

As for fraud analysis, there are many vendor tools that does exactly that, asking people to visualize and implement a full blown usecase is hard.

I might be naive I don't see the USP between artie and Airbyte, hevodata, fivetran, stitch etc. and others from a distance.