| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eth0up 714 days ago

Question: I started with a deliberately convoluted PDF which after much effort I filtered, sorted, reorganized and transferred the 18000 useful lines to a csv. These lines are simple, with dates, indicator and corresponding numbers.

The purpose is to statically analyze the numbers for anomalies or any signs of deviation from expected randomness. I do this all in python3 with various libraries. It seems to be working, but...

What is a more efficient format than csv for this kind of operation?

Edit: I have also preserved all leading zeros by conversion to strings -- csv readers don't care much for leading zeros and simply disappear them, but quotes fix that.

1 comments

brunokim 714 days ago

18k lines is very small, CSVs are fine as storage option.

My rule of thumb is that anything that fits into Excel (approx 1M lines) is "small data" and can be analysed with Pandas in memory.

link

eth0up 714 days ago

Hey, thanks for taking the time to reply. I won't be reaching 1M anytime soon, so good to know!

link