| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by CrazyStat 104 days ago
	It's not necessarily the number of lines that motivates these tools. Say you're running an NLP pipeline where you want to do sentiment analysis on a large text corpus (tweets, for example) and then relate sentiment over time to some other variables. Each of those steps might only be a dozen lines of code, but the sentiment analysis might take a nonnegligable amount of time. If you can avoid rerunning it when only the later analysis has changed that can save you considerable time while iterating on the second step of the analysis. The old fashioned way to do this in R is to use the REPL and only rerun the lines of the script that have changed, with the earlier part staying in the environment. But it's easy to make mistakes doing it manually that way; having the computer track what has changed and needs to be rerun is much less error-prone.

1 comments

tylermw 104 days ago

Yes, the main benefit is caching and reproducibility: with targets (or any other DAG-based approach), you only recompute what needs to be recomputed and you are assured that no stale inputs or temporary analysis artifacts end up in the final product. If you don't own the underlying data sources and those sources can change at any point, a DAG-based approach helps ensure that.

link