Hacker News new | ask | show | jobs
by jillesvangurp 1248 days ago
Interesting project.

A few remarks though.

- Doing real time data processing on tera/peta bytes involves a lot of IO, which is a significant part of of the cost in AWS. Things like Athena are simply not cheap to run at that scale.

- With time series data, the emphasis is usually on querying recent data, not all of the data. You retain older data for auditing for some time. But this can essentially be cold storage.

- Especially alerting related querying is effectively against recent data only. There's no good reason for this to be slow.

- People tend to scale Elasticsearch for the whole data set instead of just recent data. However, with suitable data stream and index life cycle management policies, you can contain the cost quite effectively.

- Elastic Common Schema is nice but also adds a lot of verbosity to your data, and queries. Bloating individual log entries to a KB or more. Parquet is a nice option for sparsely populated column oriented data of course. Probably the online disk storage is not massively different from a well tuned elastic index.

- Elastic and Opensearch have both announced stateless as a their next goal. So, architecturally similar to this and easier to scale horizontally.

- SIEM is just one use case. What about APM, log analytics, and other time series data? Security events usually involve looking at all of that.

1 comments

Matano is completely serverless and stores all data in ZSTD compressed parquet files in dirt-cheap object storage, allowing you to bring your own analytics stack for queries on large amounts of data for things like investigations and threat hunts. Since we store data in a columnar format and plug in query engines like Snowflake that are optimized for analytical processing the queries on specific columns will run much faster than they would run if executed on a search engine database like Elasticsearch which would require maintenance to scale.

I think it's important to understand that search engines and OLAP/data warehouse query engines have fundamental architectural differences that offer pros/cons for different use cases.

For enterprise security analytics on things like network or endpoint logs which can hit 10-100TB+/day, using anything other than a data lake is simply not a cost-effective option. Apache Iceberg was created as a big data table format for exactly this type of use case at companies like Netflix and Apple.

You are not wrong, but I do think realtime and olap have been converging a bit for a while.

Stateless elasticsearch and opensearch are actually moving to a similar model as what Matano proposes. They both have made announcements for stateless versions of their mutual forks. Data at rest with that will live in s3 and there are no more clusters, just auto scaling ingest and query nodes that coordinate via s3 and that allow you to scale your writes and reads independently. Internal elasticearch and opensearch data formats are of course heavily optimized and compact as well. Recent versions have e.g. added some more compression options and sparse colunn data support.

But they are also optimized for good read performance. There's a tradeoff. If you write once and read rarely, you'd use more heavy compression. If you expect to query vast amounts of data regularly, you need something more optimal because it takes CPU overhead to de-compress.

For search and aggregations, you either have an index or you basically need to scan through the entirety of your data. Athena does that. It's not cheap. Lambda functions still have to run somewhere and receive data. They don't run locally to buckets. Ultimately you pay for compute, bandwidth, and memory. Storing data is cheap but using it is not. That's the whole premise of how AWS makes money.

Splunk and Elasticsearch are explicitly aimed at real-time use cases (dashboards, alerts, etc.), which is also what Matano seems to be targeting. But it can also deal with cold storage. Index life cycle management allows you to move data from hot, warm, and cold storage. Cold here means snapshot in S3 that can be restored on demand for querying. It also has rollovers and a few other mechanisms to save a bit on storage. So, it's not that black and white.

Computing close to where the data lives is a good way to scale. Having indexing and caching can cut down on query overhead. That's hard with lambdas, athena, and related technology. But those are more suited for one off queries where you don't care that it might take a few seconds/minutes/hours to run. Different use case.