|
|
|
|
|
by hooande
6647 days ago
|
|
I've done my share of screen scraping, gathering all different kinds of data. Movies, sports, finance, you name it. Here are three things I can tell you: 1. Take the time to get very familiar with regular expressions. If you think you know your regex pretty well, go to the docs or get a book and find three things you don't understand and understand them fully. Then find three more. 2. The data doesn't have to be perfect. In most cases you can clean it after you've stored it. It's generally better to get more than you think you might need (in terms of data or html/formatting around the data) and then go back and clean it later 3. Generally, my most successful data mining algorithms involve a lot of hacks. There are very few clean formulas...usually I have to play with the data for awhile and fix a lot of one offs and special cases and then it ends up coming out ok |
|