| Let's try an example: `average page views in the last 1, 7, 30, 60, 180 days` You need these values accurate as of ~500k timestamps for 10k different page ids, with significant skew for some page ids. So you have a "left" table with 500k rows, each with a page id and timestamp. Then you have a `page_views` table with many millions/billions/whatever rows that need to be aggregated. Sure, you could do this with backfill with SQL and fancy window functions. But let's just look at what you would need to do to actually make this work, assuming you wanted it to be serving online with realtime updates (from a page_views kafka topic that is the source of the page views table): For online serving:
1. Decompose the batch computation to SUM and COUNT and seed the values in your KV store
2. Write the streaming job that does realtime updates to your SUMs/COUNTs.
3. Have an API for fetching and finalizing the AVERAGE value. For Backfilling:
1. Write your verbose query with windowed aggregations (I encourage you to actually try it).
2. Often you also want a daily front-fill job for scheduled retraining. Now you're also thinking about how to reuse previous values. Maybe you reuse your decomposed SUMs/COUNTs above, but if so you're now orchestrating these pipelines. For making sure you didn't mess it up:
1. Compare logs of fetched features to backfilled values to make sure that they're temporally consistent. For sharing:
1. Let's say other ML practitioners are also playing around with this feature, but with a different timelines (i.e. different timestamps). Are they redoing all of the computation? Or are you orchestrating caching and reusing partial windows? So you can do all that, or you can write a few lines of python in Chronon. Now let's say you want to add a window. Or say you want to change it so it's aggregated by `user_id` rather than `page_id`. Or say you want to add other aggregations other than AVERAGE. You can redo all of that again, or change a few lines of Python. |
Isn’t this just a table with 5bn rows of timestamp, page_type, page_views_t1d, page_views_t7d, page_views_t30d, page_views_t60d, and page_views_t180d? You can even compute this incrementally or in parallel by timestamp and/or page_type.
What’s the magic Chronon is doing?