Hacker News new | ask | show | jobs
by pragmacoders 3011 days ago
Let's say we'd like to know the median time users spend on a given page.

If we leave these at an event level, in a database, we'd need to:

- Loop through all events

- Group them by user

- Figure out which ones represent the beginning of a page view

- Figure out which ones represent the end of a page view

- Subtract these two events

- Aggregate these for all users in the database

- Find the median

This can take a prohibitive amount of time for a pretty simple question. It'd be much faster if the data was already stored in a manner that is useful for such an exploration.

As questions become more complex - so do queries. They become hard do think about and the time it takes to process them explodes.

One related concepts is "Entity-Centric Indexing", which talks about this problem in terms of ElasticSearch.

Anywho - that's a long-winded reason why I'm taking this approach!

1 comments

I answer these types of questions literally every day on Redshift. My data set is >1B events and more than doubling every year. You are underestimating the ability of a database to do its job.

Here's 18 lines of SQL to save you a week of writing a data pipeline in Go.

   WITH page_timings AS (
     SELECT page_type
          , event_timestamp_utc
          , LEAD(event_timestamp_utc, 1)
              OVER(PARTITION BY session_id
                       ORDER BY event_timestamp_utc)
            AS next_pageview
       FROM events
      WHERE event_type = 'page_view'
    )
     SELECT page_type
          , MEDIAN(
              DATEDIFF(sec,
                       event_timestamp_utc,
                       next_pageview)
            ) AS median_sec_on_page
       FROM page_timings
      GROUP BY page_type
;
Alright - Bad example. Nice SQL query!

Can we then take that raw event data, like this query, and identify users with anomalous usage patterns, automatically alerting administrators of odd activity in a timely manner?

Your answer might be "Yes". If so, I applaud your ingenuity and Redshift's ability to process data in a timely manner. It's pretty awesome, I admit!

But I believe that the best way to answer these kind of questions is to track behavioral analytics in a format that is easier to think about and easy to analyze.

Events are too deep in the weeds, I think. I may be wrong about that - But the tools I build with this will either speak for themselves or fall by the wayside as a failed experiment in history!

It's been a while since I've done database stuff, so maybe this is a naive question, but how do you find the reusability of using primarily SQL/redshift to answer these sorts of questions.

I remember getting these sorts of questions and having to write the sort of queries that you just wrote, but then having to write like 10 other ones to solve related questions with subtle differences(such as the event_type would change or something).

Hello, you may like to try out what we've been working on at Holistics (www.holistics.io)! We make it easy for you to retrieve and share your SQL data.

We help in 2 use-cases to address the "duplicative" overhead of writing the same SQL query.

1. Where you just need to adjust specific parameters/values in your query With Holistics, You can insert custom variables in your SQL query to make them dynamic to address the subtle differences without repeating yourself. See an example here: https://docs.holistics.io/guides/adding-filters/

2. Where you have chunks of text that are reused across multiple questions This can be long chunks of `CASE-IFs` or CTE/sub-queries that you use across multiple reports. You can insert a template snippet (chunks of SQL syntax for example) that can be reused across multiple questions. You can find more information here: https://docs.holistics.io/query-templates/

Thanks for the response, I'm more on the research side of things now, so I don't have a current use case, but thinking about moving back to industry, and it's something that I remember having trouble with when I was last in industry.
dashboard tools like Chartio and Looker make self-service analytics "easy" so you don't need to write 10 different flavors of the same query. We see this happening every day - our biggest customer has 1,200 analysts on Looker.

The key is how you set up your data warehouse, e.g. Amazon Redshift. What we see happening that data engineering teams provide "data services" to their company, via a set of common schemas / tables.

At the very high level:

- set up two different schemas. A (1) raw schema in which you dump all your event level data. only data engineers are allowed to access that schema. and then (2) an ad-hoc schema that analysts can use to run their queries.

- You move data from the raw to the ad-hoc schema with scheduled transformations / aggregations. Airflow, Luigi, Pinball, dbt are popular tools for that purpose. The tables in the ad-hoc schema need to be well documented so analysts can understand what data is available.

- Give every analysts a dashboard seat and access to the ad-hoc schema. Give them access to "SQL playbooks" that analysts can re-use. If you're adventurous, allow them to create their own tables.

This approach scales from a few GBs to TBs and more.

Yes, like what scapecast mentioned, the key to writing and executing SQL well is to organize and transform your data. You won't want to be running queries across large events records.

We've just published a blog post last week on some tips for analysts starting their first data warehouse project.

https://blog.holistics.io/the-analyst-guide-to-designing-a-m...

Set up a “SQL-in-charts-out” tool. There’s a number of open source ones now. Apache Superset is one example, thought I don’t necessarily recommend it specifically. This is very low cost to do and it solves this problem exactly.

The next layer of the cake is more complex and costly - setting up a SaaS BI tool like Periscope or Chartio or Looker. This only really becomes useful when you need dynamic pivoting and grouping.

I've compiled a list of all kinds of available charting/ dashboard tools in case anyone is interested in getting an overview: https://github.com/thenaturalist/awesome-business-intelligen...