| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ismail 3576 days ago

speaking from experience with dpi. We did a POC project on analysing dpi data with hadoop,spark and other big data data tech.

You are right about the volumes, but wrong about it being impractical.

The volumes with a relatively small opco:

- +-7m subs

- 250gb just for the protocol classification. *

- Then you also have url logs etc

Key factors that reduce the costs and investment:

- commodity hardware (with hadoop etc)

- distributed

- query patterns

- you do not need to store every single record. The data can be aggregated up to hourly, daily, monthly the older it is

This is what we did, data was aggregated which significantly reduced storage.

Tested various options: Hive, hbase, druid

Edit: * per day