Hacker News new | ask | show | jobs
by chollida1 3068 days ago
People who know how to clean and work with large disparate datasets.

I guess its not surprising given how Jim Simons of Renaissance Technologies fame indicated that of his first 10 employee's almost half were cleaning data.

It's alot more difficult than people realize and you know right away that someone has no real idea of the scale and difficulty of the problem when they suggest that a shell script can solve most of the data issues.

I think Renaissance Technologies actually illustrates just how much a good data cleaning and back testing platform is a real competitive advantage.

A couple of former RenTech people left for Millennium partners and for a couple of years.

Even though these employees were good enough to work at RenTech and had insights into the strategies employed there, they weren't able to be successful on their own without the huge backtesting and data cleaning framework at RenTech.

1 comments

What are some if the types of problems that needed to be solved when cleaning data that required heavy tooling?
Lots of data typically means streams of data, which means processes running 24/7 moving data and files around. Streams, connectivity, and processes can cut out periodically which means you need some logic to reconcile and fill the gap and also restart the processes. You will also need some data QA as it is perfectly reasonable to get 'extra' data, either as duplicates or metadata bleeding into content.

If your data is from disparate sources then you may need to normalize timestamps across records from different sources, you may be dealing with different languages, identical tokens that mean different things depending on the source, different formatting of numeric fields, etc.

This is an incomplete list, the GP probably has a more exhaustive list of problem types...

ie a subset of Data Engineering role