|
Hey, great questions! Query Performance: Right now, we've got test machines deployed with 8 500GB disks for packets + 1 indexing disk (all 15KRPM spinning disks). They keep at 90% full, or roughly 460GB/disk, about 1K files/disk. Querying over the entire corpus (~4TB of packets) for something innocuous like 'port 65432' takes 25 seconds to return ~50K packets (that's after dropping all disk caches). The same query run again takes 1.5 sec, with disk caches in place. Of course, the number of packets returned is a huge factor in this... each packet requires a seek in the packets file. Searching for something that doesn't exist (host 0.0.0.1) takes roughly 5 seconds. Note that time-based queries, like "port 4444 and after 3h ago and before 1h ago" do choose to only query certain files, taking advantage of the fact that we name files by microsecond timestamp and we flush files every minute. A big part of query performance is actually over-provisioning disks. We see disk throughput of roughly 160-180MB/s. If we write 160MB/s, our read throughput is awful. If we write 100MB/s, it's pretty good. Who would have thought: disks have limited bandwidth, and it's shared between reads and writes. :) We actually don't use LevelDB... we use the SSTables that underly LevelDB. Since we know we're write-once, we use https://github.com/google/leveldb/blob/master/include/leveld... directly for writes (and its Go equivalent for reads). I'm familiar with the file format (they're used extensively inside Google), so it was a simple solution. That said, it's been very successful... we tend to have indexes in the 10s of MBs for 2-4GB files. Of course, index size/compressibility is directly correlated with network traffic: more varied IPs/ports would be harder to compress. The built-in compression of LevelDB tables is also a boon here... we get prefix compression on keys, plus snappy compression on packet seek locations, for free. We currently do no compression of packets. Doing so would definitely increase our CPU usage per packet, and I'm really scared of what it would do to reads. Consider that reading packets in compressed storage would require decompressing each block a packet is in. On the other hand, if someone wanted to store packets REALLY long term, they could easily compress the entire blockfile+index before uploading to more permanent storage. I expect this would be better than having to do it inline. Even if we did build it in, we'd probably do it tiered (initial write uncompressed, then compress later on as possible). AF_PACKET is no better than PF_RING+DNA, but I also don't think it's any worse. They both have very specific trade-offs. The big draw for me for AF_PACKET is that it's already there... any stock Linux machine will already have it built in and working. Thus steno should "just work", while a PF_RING solution has a slightly higher barrier to entry. I think PF_RING+DNA should give similar performance to steno... but libzero currently probably gives better performance because packets can be shared across processes. This is a really interesting problem that I'm wondering if we could also solve with AF_PACKET... but that's a story for another day. Short story: I wanted this to work on stock linux as much as possible. |
I'm interested because I wrote an app-specific indexer, but with requiring "interactive" query response times over a couple TB, for multiple users. But that was years ago, before LevelDB and Snappy, and Kyoto Cabinet had far too much overhead per kv), and on small CPUs and a single 7200rpm disk. I got compressions rates of 5 to 6 using QuickLZ; a non-trivial gain.
I was looking at this problem space again and considering a delta+int compression approach to offsets, given they're just incremental. (And there are cool SIMD algorithms for 'em.) But it sounds like SSTable + fscache is fast enough, wow, that's pretty cool!
The decompression of blocks in some apps doesn't have to be much of a penalty if there's a reasonable amount of clustering going on in the sample set. What I did was instead of just splitting blocks on time, I segmented them based on flow and time. I did L7 inspection, and an old quad-core Core2 could handle 1Gbps, so 10Gbps is probably achievable nowadays, certainly for L4 flows. That way there's great locality for most queries.
Further, the real cost is the seek, and transferring a few more sectors won't cost as much. If you're using mmap'd IO for reading, you might be able to compress pages and not pay any IO penalty, right? And in fact, it might even reduce the number of seeks, due to increasing clustering of packets onto the same page. And I think some of the fastest compression algorithms only look back a very small amount, like 16K or 64K anyways? Although, this is probably easier done just by using a compressed filesystem cause the cache management code is probably nontrivial.