|
Great questions. I worked on similar problems in the weather/ag space for a few years, trying to minimize the time between data was acquired and data is ready to inform a decision. We threw every rule out the window in the name of performance _when fetching raw data from external sources_. So we had weather station networks, NOAA forecast runs and NASA satellite data in a workable schema in our shop way faster than average. Mix of C, PowerShell, Perl, and the nonstandard parts of T-SQL, highly parallelized, tricky but fast. After the "workable schema" was established, the rules came back and we acted more responsibly. Smart instead of clever. Ran this stuff all day long, getting every piece of data asap. Things that can only be calculated with a full day of data we poked and prodded the meteorologists to express in "partial aggregates", which to me were just like the map steps before an EOD reduce. Took a lot of mutual understanding and iterating but worth it in the end. When the ultimate data source (satellite or radar site for us) posted its last hour of data, we were 95% done with the day's computation work. We do our last step, publish our numbers, and bam, Our ag clients have this stuff a day earlier than they are used to. |