Hacker News new | ask | show | jobs
by yonl 2802 days ago
I'm glad you asked this.

So here is the thing with our DataModel. Every time you perform an ops on DataModel it create another instance. Now performing multiple such actions create a DAG where each node is an instance of DataModel and each edge is an operation.

We have auto interactivity, which propagates data (dimensions) pulse along the network. Any node which is attached to visualiztion receives those pulses and changes the visual.

So far I have not found any relational interface which exposes this DAG graph and api to user. Hence we though of building this.

Having said that, we might use some established relational interface and do the propagation ourself.

1 comments

The implementation you are discussing sounds pretty elegant. I am most familiar with Power BI from a data viz perspective, but have used most of the enterprise viz tools out there.

The thing that always struck me about Power BI (and also Qlik) is that it is very much a model-first tool. Visualization is secondary to the model, to the extent that much of the friction I see in new users has been treating it as a reporting/layout/visualization tool when, in fact, it is a data modeling tool with a visualization engine strapped on.

One of the big drawbacks with Power BI is that it has a terribly inefficient implementation for propagating filter contexts for visual interactions (this is their translation of your "auto interactivity, which propagates data (dimensions) pulse along the network"). I do not know the internal implementation, but I am relatively certain that visual interactions are ~O(N!) in the number of visual elements on a report page, based on my experience of performance scaling across a wide range of reports. Regardless, one of the best practices is to limit a Power BI report page to a small number of visualizations (recommendations of the cutoff value vary, and types of visuals can also impact this).

If I understand you correctly, you are calculating the minimum set of recalculations/re-renderings necessary, based on the data element that a user has interacted with. This should be something much closer to O(N) in the number of visuals to propagate user selections to other visuals. I am making an assumption that most visuals should interact, as typically the scope of a single report should have a high degree of intersection of dimensionality across all report elements.

I do not know of any analytics engine that exposes the sort of DAG and associated API you are discussing, either. The reason for my initial question was simply because that sort of engine is a product in and of itself. There are plenty of columnstore databases (and following other paradigms, but optimized for OLAP workloads) out there. It seems like biting off a lot to tackle both the data engine and the visualization tier at the same time.

The big reason that I ask is that this sort of approach to visualization seems to me to benefit greatly from a data model that supports transaction-level detail. The type of interactivity that you expose is extremely powerful. I have seen interactive tools hamstrung by data models that do not allow sufficient interaction. As soon as you put interactivity in front of users, in my experience, they want to do more with the data. If you are limited to datasets that can live comfortably in the browser, that seems a showstopper to me, as it will require pre-aggregation to fit most of the datasets I've seen; pre-aggregation negates many benefits of interactive data exploration.

I'll be taking a much further dive into your product either this weekend or next. I'm very interested.

You are absolutely correct the propagation for us is O(n) as the graph is directed. But the problem there is multifold. Once a node receives propagation pulse it tries to figure out the affected subset using the dimensions received as propagation pulse. This requires joining, hence a chance to build a O(mn)cartesian product. If you see https://www.charts.com/muze/examples/view/crossfiltering-wit... example the contribution bars are drawn when the first chart is dragged requires joining follower by groupBy.

Which is why performing this in browser env even for low amount of data (say 10k) is nightmare. There are ways you can address this but while in browser you hit the limit pretty soon.

We wanted the concept to be validated first hence we have build it for browser only. But would love to hear / learn / discuss with you on this before we go ahead and build the data model in server.

Another ambiguity with the interaction is visual effect of interaction. Questions like do you really want all your chart to be cross connected. A in house survey showed us there is no certainty of the answer. And what kind of visual effect should happen on interaction differs person to person and is a function of use case. Which is why we have chosen go for chosen behaviour like

``` muze.ActionModel.for(...canvases) / for all the chart in page / .enableCrossInteractivity() / allow default cross interactivity / .for(tweetsByDay, tweetsByDate) / but for first two canvas in the example / .registerPropagationBehaviourMap({ select: 'filter', brush: 'filter' }) / if selection using mouse click or brushing happens filter data / ```

we are still writing docs for this. We hope to finish all the these docs in two weeks time.

I'm happy to continue this discussion in further detail and share my experience. You can get in touch with me at the email address in my profile if you'd like.

You're hitting a very important question in your fourth paragraph about ambiguity of desired effect from interaction. I often catch myself thinking I've heard every use case and built most of them in various viz tools. But I have learned that I am always wrong when I think that. I frequently encounter people asking for new things and it is always a toss-up whether what they want is trivial and novel or impossible and obvious.

I tend to be a data-guy much more than a viz-guy, but I fully understand the value of viz for actually presenting knowledge. Like I said, I'm interested in trying out your tool more.

Out of interest, what size of dataset are you talking about? Thousands of records? Millions?
Customers I've worked with that have small datasets would typically range into the 10M order of magnitude for a primary fact, though we had smaller outliers. Additionally, it would be common to have wide dimensions that could be KBs/record, which can add up quickly.