| As part of our research group, we're collecting large amounts of location data. Our data essentially looks like (user id, lat/long co-ordinates, timestamp). There's other metadata involved too, but that's not relevant here. We're collecting about 2-3 million records a week, and expect to collect about a year's worth of data in due time. I'd really like some advice on techniques on storing and processing this data. We'd like to be able to answer queries similar to: (1) For a given location, who was near that location (within a specified distance) over a specified period of time? (2) Which locations are near each other? That's the general idea. We don't need a real-time response, but what are good databases (or other data storage software)? I've come across people talking about k-d trees, does that work at this scale? What kind of hardware do I need? I'm hoping to get pointers towards general strategies. How do we store this data? Does it even make sense to store it all in a database? Which data/software/packages lend themselves well to distance/radius calculations? We're most familiar with Python/Linux, would prefer to stay away from Java and prefer open source/free software. We're new to all this, pointers to books and papers would also be useful. All and any advice would be greatly useful. |
If the databases can handle your queries efficiently then it could save you a lot of work and you won't have to do anything nasty.
From a quick back-of-the-envelope calculation, you should have under 20GB of data, so if you need to perform an inefficient computation with low latency storing it all in RAM isn't out of the question.