|
|
|
|
|
by Jerry2
2674 days ago
|
|
I'm writing an IoT library for devices with tiny microprocessors and have been sending data as JSON or BSON (binary JSON). On the backend, I've been storing reports from IoT devices into a database (MariaDB on AWS). How crazy would it be to just store all the data as JSON files on disk (or S3 bucket) and then batch process them when I need to perform data analysis on them? If a million devices sends dozens of status reports per day, that's going to be a crapton on files... but that might be faster to process than querying the database. If you or anyone else has some opinions on this, please let me know! I'd really like to learn how people do this type of analysis at scale. |
|
One thing locally is each file takes up a full block. So even if you only need 500 bytes of data in a file, and a block is 4kb, youve wasted 3.5kb of space and IO. Multiply that by a million and youre wasting gigabytes of space.
In S3, listing 12 million files takes 12 thousand http(max return is 1000 items). So that would take two minutes if you assume its 10ms per round trip. Let's say you wanted to read each file, and again each read takes 10ms.. youre looking at 1.4 days. Obviously this can be parallelized, but when you look at the raw byte size this is a huge overhead, and this is just to read one day of data.
If you concatenate the files together to get a reasonable size and number of files, raw json on s3 is really powerful. Point athena at it, and you just write sql and it handles the rest, and is serverless. But it does make single row lookups more expensive(supplementing with dynamodb could keep it serverless if single row lookups are frequent).
lots of optimizations will get improvements, like parquet that tobilg mentioned(binary format and columnar), but anything with a decent file size will work.