Hacker News new | ask | show | jobs
by Jerry2 2674 days ago
I'm writing an IoT library for devices with tiny microprocessors and have been sending data as JSON or BSON (binary JSON). On the backend, I've been storing reports from IoT devices into a database (MariaDB on AWS). How crazy would it be to just store all the data as JSON files on disk (or S3 bucket) and then batch process them when I need to perform data analysis on them? If a million devices sends dozens of status reports per day, that's going to be a crapton on files... but that might be faster to process than querying the database.

If you or anyone else has some opinions on this, please let me know! I'd really like to learn how people do this type of analysis at scale.

9 comments

reading lots of small files on s3 or local filesystems is tricky. a million devices with one dozen files, so lets say 12 million files.

One thing locally is each file takes up a full block. So even if you only need 500 bytes of data in a file, and a block is 4kb, youve wasted 3.5kb of space and IO. Multiply that by a million and youre wasting gigabytes of space.

In S3, listing 12 million files takes 12 thousand http(max return is 1000 items). So that would take two minutes if you assume its 10ms per round trip. Let's say you wanted to read each file, and again each read takes 10ms.. youre looking at 1.4 days. Obviously this can be parallelized, but when you look at the raw byte size this is a huge overhead, and this is just to read one day of data.

If you concatenate the files together to get a reasonable size and number of files, raw json on s3 is really powerful. Point athena at it, and you just write sql and it handles the rest, and is serverless. But it does make single row lookups more expensive(supplementing with dynamodb could keep it serverless if single row lookups are frequent).

lots of optimizations will get improvements, like parquet that tobilg mentioned(binary format and columnar), but anything with a decent file size will work.

Yeah, this is what Kinesis Firehose is for. Send all of your messages there and it will batch them to S3.
You may enjoy this:

The best way to not lose messages is to minimize the work done by your log receiver. So we did. It receives the uploaded log file chunk and appends it to a file, and that's it. The "file" is actually in a cloud storage system that's more-or-less like S3. When I explained this to someone, they asked why we didn't put it in a Bigtable-like thing or some other database, because isn't a filesystem kinda cheesy? No, it's not cheesy, it's simple. Simple things don't break.

https://apenwarr.ca/log/20190216

We‘re using AWS Kinesis delivery streams to batch incoming JSON messages from IoT devices to Parquet files in S3. Those can directly be read by different AWS services like Redshift, EMR or Athena...
We use Athena for all our robotics data, which we ETL into JSON. It's fantastic for queries that are simple time-slice queries, as most are because sensor data is inherently time-series. When more complicated joins are necessary, the performance is there across terabytes, and the cost is very very low, $5 per terabyte scanned (storage costs are another thing).
What bothers me about Kinesis is that it is prohibitively expensive at scale if you don't compress your data before putting it to Kinesis.

But if you want to use the nice features like parquet conversion your data can't be compressed.

If it could handle compressed data at the same price I would use a lot more of it.

You’ve kinda just described AWS Athena.
This comment needs to be higher up; Amazon has a service for doing just this, dumping 'dumb' files (like json, csv, etc) into S3 buckets and performing SQL queries on them. No need to have to think about how to store things for future querying.
I've used Athena really effectively to solve similar problems. If your data storage is relatively small and/or your queries relatively infrequent, JSON can be a good fit. As one of those dimensions expands, you can decrease costs/increase performance by converting to Parquet and compressing.
I am replying to you as an engineer at an IoT company that provides SaaS in AWS for the data our devices produce. To solve this problem, we transmit our data in a proprietary "raw" binary format that then gets parsed into a protobuf. All data for a given UTC day is appended to this protobuf file and hosted in S3. Retrieving data requires downloading the protobuf file from S3, unmarshalling the protobuf, and finding the entry you care about.
If you are considering using plain files instead of a DB server, you could try a compromise and use an embedded key-value store like RocksDB, LevelDB, BadgerDB etc.

It's local storage only, limited query capabilities depending on the DB, but should be extremely fast.

Why not use a timeseries database, like http://btrdb.io?
well if you need indexed lookups, then use a database

if you're doing "table scan" processing of entire datasets, sure just-a-bunch-of-files would work too.

Databases can be surprisingly fast for things like that, since high performance file i/o is full of tricky/annoying stuff that databases have already optimized for.

Depending on your size / budget / needs Snowflake may interest you. https://www.snowflake.com/product/architecture/.

I haven't used it but have been given a presentation by them on it, and it was very very good.

They store data in S3 and use FoundationDB for indexes. You can feed it JSON and it'll index it and let you query it on a massive scale shockingly fast.

Obviously they are not aimed at small hobby projects but if your project has money / serious product depending on your needs it's well worth looking at.

On the S3 cheaper / smaller end you can batch up data daily / weekly etc. So the landing bucket acts as a queue that gets processed creating daily batch files from the small files aggregated together. You can then take the daily batches to create weekly batches etc etc, essentially partitioning. This will reduce the total number of files needed to query. If you use deterministic names based on how you plan to query this can also reduce the number of files you need to list / parse. When batching / re-partitioning the data you can also use the Apache Parquet format to compress a little better + also import in some of the querying tools out there.