Ask HN: Storing and Processing Large Amounts of Temporal-Spatial Data

Y	Hacker News new \| ask \| show \| jobs

2 points by khandelwal 5572 days ago

As part of our research group, we're collecting large amounts of location data. Our data essentially looks like (user id, lat/long co-ordinates, timestamp). There's other metadata involved too, but that's not relevant here.

We're collecting about 2-3 million records a week, and expect to collect about a year's worth of data in due time.

I'd really like some advice on techniques on storing and processing this data. We'd like to be able to answer queries similar to:

(1) For a given location, who was near that location (within a specified distance) over a specified period of time?

(2) Which locations are near each other?

That's the general idea. We don't need a real-time response, but what are good databases (or other data storage software)? I've come across people talking about k-d trees, does that work at this scale? What kind of hardware do I need? I'm hoping to get pointers towards general strategies. How do we store this data? Does it even make sense to store it all in a database? Which data/software/packages lend themselves well to distance/radius calculations?

We're most familiar with Python/Linux, would prefer to stay away from Java and prefer open source/free software. We're new to all this, pointers to books and papers would also be useful. All and any advice would be greatly useful.

2 comments

notaddicted 5572 days ago

Both MySQL and Postgres support some GIS features, I would definitely recommend taking a look. I bookmarked this book about postGIS but I haven't read it: http://www.postgis.us/ .

If the databases can handle your queries efficiently then it could save you a lot of work and you won't have to do anything nasty.

From a quick back-of-the-envelope calculation, you should have under 20GB of data, so if you need to perform an inefficient computation with low latency storing it all in RAM isn't out of the question.

link

khandelwal 5572 days ago

Postgres seems like a good option. Thanks!

link

bartonfink 5572 days ago

I had an interview with Boeing, and found out that they did work with Sybase to develop some DB extensions for just that purpose. I think they license them out, but that's not going to work well with your FOSS desires. Postgres seems to have extensions to allow temporal work - check here. http://pgfoundry.org/projects/temporal/

link