|
|
|
|
|
by jmspring
1849 days ago
|
|
"They lack the structure to make the data they contain readily accessed programatically, so step one is miserable screen scraping and data cleanup. That’s where a lot of people give up." I've been contemplating the idea of curated data and data provenance. Some data sources are easy to use but some as mentioned need clean up and thus you run into the question of original source vs. "cleaned up". Curated data where the original source is linked, any tools/scripts used to clean it up (harder when manual intervention is needed), are also included. Maybe even a pipeline setup where original + tools = cleaned outcome is a thought. It's a pipe dream and many data sources are still too unstructured for such automation. |
|