| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aba_sababa 4638 days ago

How about having a table where each row corresponds to a duration of time?

time | total | mice_ids (a json string. or use mongo)

And then, run through all your data and assign each mouse to a certain duration (ie, minute one, minute two, etc), updating the total counts every time as well as the mice_ids. Then, you could query by minute and see which minutes had meetings in them (ie, select mice where total > 2)

1 comments

georgismilyanov 4638 days ago

Thank you! But I am thinking, could I be able to obtain easily the duration of each meeting using this setup? This is what I would ultimately need (to construct the empirical cdf, social network, etc.)

link

aba_sababa 4637 days ago

It also doesn't strike me as especially expensive. A dumb, brute force query (that is, something in 0(N^2)) wouldn't take so long with only a few million rows. So I would just try doing through each row, doing a full table scan to find meetings, and just let that cook for a little while.

It's not like you're ingesting gobs and gobs of data in real time :)

link

aba_sababa 4637 days ago

Hmm. How are you defining a meeting? Is a meeting a pair of mice? Or is it a pair of mice with a time component? So a pair of mice + the duration of time they were together?

link

georgismilyanov 4637 days ago

Exactly.

There are four types of meeting: mouse 1 entered -> mouse 2 entered -> mouse 2 left -> mouse 1 left; mouse 1 entered -> mouse 2 entered -> mouse 1 left -> mouse 2 left; mouse 2 entered -> mouse 1 entered -> mouse 1 left -> mouse 2 left; mouse 2 entered -> mouse 1 entered -> mouse 2 left -> mouse 1 left.

Then the meeting record contains the times the two were together and the duration of this.

The dumb approach is very slow. Would it be a good idea to read all stay results in Scala (or whatever) and try to do the searching in memory, completely eliminating th db? But then the comparisons would be more expensive...

link