| I've developed PipeFunc, a new Python library designed to simplify the creation and execution of DAG-based computational pipelines, specifically targeting scientific computing and data analysis workflows. It's built for speed and ease of use, with a focus on minimizing boilerplate and maximizing performance. Key features: • Automatic Dependency Resolution: PipeFunc automatically determines the execution order of functions based on their dependencies, eliminating the need for manual dependency management. You define the relationships, and PipeFunc figures out the order. • Ultra-Low Overhead: The library introduces minimal overhead, measured at around 15µs per function call. This makes it suitable for performance-critical applications. • Effortless Parallelism: PipeFunc automatically parallelizes independent tasks, and it's compatible with any `concurrent.futures.Executor`. This allows you to easily leverage multi-core processors or even distribute computation across a cluster (e.g., using SLURM). • Built-in Parameter Sweeps: The `mapspec` feature provides a concise way to define and execute N-dimensional parameter sweeps, which is often crucial in scientific experiments, simulations, and hyperparameter optimization. It uses an index-based approach to do this in parallel with minimal overhead. • Advanced Caching: Multiple caching options helps avoid redundant computations, saving time and resources. • Type Safety: PipeFunc leverages Python's type hinting to validate the consistency of data types across the pipeline, reducing the risk of runtime errors. • Debugging Support: Includes an `ErrorSnapshot` feature that captures detailed error state information, including the function, arguments, traceback, and environment, to simplify debugging and error reproduction. • Visualization: PipeFunc can generate visualizations of your pipeline to aid in understanding and debugging. Comparison with existing tools: • vs. Dask: PipeFunc provides a higher-level, declarative approach to pipeline construction. It automatically handles task scheduling and execution based on function definitions and `mapspec`s, whereas Dask requires more explicit task definition. • vs. Luigi/Airflow/Prefect/Kedro: These tools are primarily designed for ETL and event-driven workflows. PipeFunc, in contrast, is optimized for scientific computing and computational workflows that require fine-grained control over execution, resource allocation, and parameter sweeps. Use Cases: • Scientific simulations and data analysis • Machine learning pipelines (preprocessing, training, evaluation) • High-performance computing (HPC) workflows • Complex data processing tasks • Any scenario involving interconnected functions where performance and ease of use are important I'd appreciate any feedback, especially regarding performance, usability, and potential applications in different scientific domains. Links: Documentation: https://pipefunc.readthedocs.io Source Code: https://github.com/pipefunc/pipefunc |