| There's a lot more to it than snapshots or timestamped columns when it comes to ML training data generation. We often have windowed aggregations that need to computed as of precise intra-day timestamps in order to achieve parity between training data (backfilled in batch) and the data that is being served online realtime (with streaming aggregations being computed realtime). Standard OLAP solutions right now are really good at "What's the X day sum of this column as of this timestamp", but when every row of your training data has a precise intra-day timestamp that you need windowed aggregations to be accurate as-of, this is a different challenge. And when you have many people sharing these aggregations, but with potentially different timestamps/timelines, you also want them sharing partial aggregations where possibly for efficiency. All of this is well beyond the scope that is addressed by standard OLAP data solutions. Not to mention the fact that the offline computation needs to translate seamlessly to power online serving (i.e. seeding feature values, and combining with streaming realtime aggregations), and the need for online/offline consistency measurement. That's why a lot of teams don't even bother with this, and basically just log their feature values from online to offline. But this limits what kind of data they can use, and also how quickly they can iterate on new features (need to wait for enough log data to accumulate before you can train). |
As long as your OLAP table/projection/materialized view is sorted/clustered by that timestamp, it will be able to efficiently pick only the data in that interval for your query, regardless of the precision you need.
> And when you have many people sharing these aggregations, but with potentially different timestamps/timelines, you also want them sharing partial aggregations where possibly for efficiency.
> All of this is well beyond the scope that is addressed by standard OLAP data solutions.
I think the StarRocks open-source OLAP DB supports this as a query rewrite mechanism that optimizes performance by using data from materialized views. It can build UNION queries to handle date ranges [1]
[1] https://docs.starrocks.io/docs/using_starrocks/query_rewrite...