| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by w0rd-driven 2908 days ago
	I crawl a specific site somewhere up to 50 unique URLs a day. I store both the unparsed full html as a file and the json I'm looking for as another separate file. The idea is if something breaks instead of taking a hit to make the call again, I have the data and I should just process that. It's come in extremely handy when a site redesign changed the DOM and broke the parser. I do the same at $dayJob where I'm parsing results of an internal API. Instead of making a call later that may not have the same data, I store the json and just process that. I feel like treating network requests as an expensive operation, even though they're not really, helped me come up with some clever ideas I've never had before. It's a premature optimization considering I've had like 0.000001% of failure but being able to replay that one breakage made debugging an esoteric problem waaaaaay simpler than it would've been otherwise.

1 comments

pdimitar 2907 days ago

Off-topic: I so wish I worked for a company where my work involves scraping and storing and analyzing data. :(

link

hoju 2907 days ago

Now is a good time to work in this field since data science is hot and companies need web scrapers to provide the data for these models. Atleast that has been my experience in finance. Try applying!

link

pdimitar 2906 days ago

I have zero experience in data science though. I am a pretty solid and experienced programmer and can learn it all but... don't know. Maybe I should just try indeed.

Do you have any recommendations for places and/or interview practices?

link