Hacker News new | ask | show | jobs
Show HN: Graphsignal – ML profiler to speed up training and inference (github.com)
16 points by dmitrim 1452 days ago
Hi, Graphsignal founder here. We've launched Graphsignal earlier this year to make machine learning profiling practical and easy to use. Basically, it enables the profile-optimize-benchmark loop. For example, making inference faster by optimizing an ML model, while still maintaining accuracy.

We've make a lot of progress that I wanted to share.

The profiler now natively supports TensorFlow, Keras, PyTorch, PyTorch Lightning, Hugging Face, XGBoost and JAX frameworks along with built-in support for distributed workloads.

Profiles now include tracing information in chrome trace format. Process and GPU utilization data has been extended as well.

It is now possible to monitor all run metrics. Useful for long runs.

Profiled workloads are now sharable across teams and publicly (if enabled).

I'm excited to show it here and appreciate any thoughts, comments and feedback!

2 comments

Related:

Show HN: Graphsignal – Machine learning profiler for training and inference - https://news.ycombinator.com/item?id=30628618 - March 2022 (8 comments)

Can it measure internode traffic for distributed training runs? This is something I needed recently and couldn’t achieve using nccl-test utilities like mpirun. ib_write_bw also didn’t work, I suspect because of multiple virtual links.
For now it only tries to extract NCCL time percentage from the profile, if available, and show it profile summary. Some hints count be in the step trace timeline as well. We are planning to record some NCCL related counters separately as well.
The problem with nccl is it reports combined bandwidth: nvlink (intranode) and network. I want to see the network traffic, for example to identify a network link bottleneck when changing model or pipeline parallelism configuration.

p.s. if you solve this I’ll become a paying customer.

Understand, we'll definitely think about the network part. Just in case it may help, if `nvidia-smi nvlink -gt d` is useful for you in this context then there is a related metric NVLink Throughput Rate to compare runs and monitor. At least you might get an idea whether/how internal links are utilized.
Yes, I thought about it - in theory I can measure the total traffic with mpirun, then substract nvlink traffic (as measured by nvidia-smi) from it. However I'm not 100% sure that the nvlink traffic from nvidia-smi is the same as the nvlink traffic component of the mpirun. I'd prefer to measure internode traffic directly (e.g. using Mellanox tools) as a more reliable method.
Yes, exactly this.