Hacker News new | ask | show | jobs
by Wronnay 1390 days ago
One file has a size of nearly 5 GB... Every day two of these files get released... So nearly 10 GB every day.

So if we download the raw data every day for one year, we would have 3650 GB just for one year...

It would be interesting how much reduced the size of the processed data is compared to the raw data. You say that you have 50 GB of data spanning multiple years. How many years exactly?

2 comments

Ah, this is a nuanced point I totally left off the README. Each raw file is ~5GB, but the raw files are a dump of network traffic from the firehose feed that tracks not just trades, but also updates to orders that do not result in trades. If you skip all of the algorithmic bots' constant updates to their bid/offer spreads and look at just the trades that clear, you can store years of raw trades in under 100GB.
Yes, this type of datasets are massive. We use TAQ for research work but almost never use the raw data as is.

> So if we download the raw data every day for one year, we would have 3650 GB just for one year.

A small correction - the stock market is open only about 250 days a year so with your calculations the raw data size will be 2500 GB. Still massive.