Hacker News new | ask | show | jobs
by shaeqahmed 1240 days ago
Matano is completely serverless and stores all data in ZSTD compressed parquet files in dirt-cheap object storage, allowing you to bring your own analytics stack for queries on large amounts of data for things like investigations and threat hunts. Since we store data in a columnar format and plug in query engines like Snowflake that are optimized for analytical processing the queries on specific columns will run much faster than they would run if executed on a search engine database like Elasticsearch which would require maintenance to scale.

I think it's important to understand that search engines and OLAP/data warehouse query engines have fundamental architectural differences that offer pros/cons for different use cases.

For enterprise security analytics on things like network or endpoint logs which can hit 10-100TB+/day, using anything other than a data lake is simply not a cost-effective option. Apache Iceberg was created as a big data table format for exactly this type of use case at companies like Netflix and Apple.

1 comments

You are not wrong, but I do think realtime and olap have been converging a bit for a while.

Stateless elasticsearch and opensearch are actually moving to a similar model as what Matano proposes. They both have made announcements for stateless versions of their mutual forks. Data at rest with that will live in s3 and there are no more clusters, just auto scaling ingest and query nodes that coordinate via s3 and that allow you to scale your writes and reads independently. Internal elasticearch and opensearch data formats are of course heavily optimized and compact as well. Recent versions have e.g. added some more compression options and sparse colunn data support.

But they are also optimized for good read performance. There's a tradeoff. If you write once and read rarely, you'd use more heavy compression. If you expect to query vast amounts of data regularly, you need something more optimal because it takes CPU overhead to de-compress.

For search and aggregations, you either have an index or you basically need to scan through the entirety of your data. Athena does that. It's not cheap. Lambda functions still have to run somewhere and receive data. They don't run locally to buckets. Ultimately you pay for compute, bandwidth, and memory. Storing data is cheap but using it is not. That's the whole premise of how AWS makes money.

Splunk and Elasticsearch are explicitly aimed at real-time use cases (dashboards, alerts, etc.), which is also what Matano seems to be targeting. But it can also deal with cold storage. Index life cycle management allows you to move data from hot, warm, and cold storage. Cold here means snapshot in S3 that can be restored on demand for querying. It also has rollovers and a few other mechanisms to save a bit on storage. So, it's not that black and white.

Computing close to where the data lives is a good way to scale. Having indexing and caching can cut down on query overhead. That's hard with lambdas, athena, and related technology. But those are more suited for one off queries where you don't care that it might take a few seconds/minutes/hours to run. Different use case.