|
|
|
Ask HN: Which CS topics should I study to be able to work with data efficiently?
|
|
2 points
by curious16
1015 days ago
|
|
Let me be clear upfront about what I mean by working with data. Generally working with data means that you have some form of data and you do some kind of analysis on it. Now once I have very clean structured data already available data analysis becomes easy. But what if I have data that is not clean in form of files or databases? Then I have to read the files, clean it up and structure it in data structures according to my need (is THIS called parsing?). I am talking about this preprocessing part. What CS or programming subjects should I study to become somewhat of an expert in data cleaning, preprocessing and structuring large amounts of files in batches? I am also interested in the second part of the pipeline where I analyse the data and produce output both in terms of good visualisations and output data to be stored in files. Any books/courses or any other types of resource pointers will be appreciated. P.S.: Files can be anything. They are just streams of bytes. Images, audio, video, text, csv. |
|
There’s probably useful free data sets out on the internet. Learning python is useful.
I know AWS has a heap of services catering to data pipe lines.. maybe see if there’s a free tier on anything.
The fixing of bad data I’ve most commonly heard of as “data cleansing” or “data scrubbing”.