| Interesting project. A few remarks though. - Doing real time data processing on tera/peta bytes involves a lot of IO, which is a significant part of of the cost in AWS. Things like Athena are simply not cheap to run at that scale. - With time series data, the emphasis is usually on querying recent data, not all of the data. You retain older data for auditing for some time. But this can essentially be cold storage. - Especially alerting related querying is effectively against recent data only. There's no good reason for this to be slow. - People tend to scale Elasticsearch for the whole data set instead of just recent data. However, with suitable data stream and index life cycle management policies, you can contain the cost quite effectively. - Elastic Common Schema is nice but also adds a lot of verbosity to your data, and queries. Bloating individual log entries to a KB or more. Parquet is a nice option for sparsely populated column oriented data of course. Probably the online disk storage is not massively different from a well tuned elastic index. - Elastic and Opensearch have both announced stateless as a their next goal. So, architecturally similar to this and easier to scale horizontally. - SIEM is just one use case. What about APM, log analytics, and other time series data? Security events usually involve looking at all of that. |
I think it's important to understand that search engines and OLAP/data warehouse query engines have fundamental architectural differences that offer pros/cons for different use cases.
For enterprise security analytics on things like network or endpoint logs which can hit 10-100TB+/day, using anything other than a data lake is simply not a cost-effective option. Apache Iceberg was created as a big data table format for exactly this type of use case at companies like Netflix and Apple.