Hacker News new | ask | show | jobs
Show HN: Graphsignal – Machine learning profiler for training and inference (graphsignal.com)
35 points by dmitrim 1557 days ago
Hi HN, I'm the founder of Graphsignal (https://graphsignal.com). Graphsignal is a machine learning profiler. We've created it to make ML profiling simple and usable. It provides performance summaries, ML operation and kernel level statistics as well as detailed resource usage information necessary for making training and inference faster and more efficient.

Profilers help fix performance issues, improve user experience and reduce computation costs. Such improvements benefit machine learning profoundly; model training jobs that run for hours or days could be made much shorter and inference latency could be reduced resulting in significantly lower costs and improved user experience.

I realized the benefits in one of my previous projects, where the model would have to be trained regularly and be used for inference on huge amount of data. Having spent last decade developing profiling and monitoring tools, it seemed logical for me to use a profiler for the task. But since the training and inference were running remotely, I had a hard time using existing ML profilers.

TensorFlow and PyTorch provide built-in ML profilers, which utilize NVIDIA's profiling interface (CUPTI) under the hood for GPU profiling. One way to use those profilers is via locally installed TensorBoard or by logging the profiles.

In turn, Graphsignal Profiler (https://github.com/graphsignal/graphsignal) uses the built-in profilers as well as other tools to enable automatic profiling in any environment, including notebooks, training pipelines, periodic batch jobs, model serving and so on, without installing additional servers/software. It also allows teams to share and collaborate online. Basically, the profiles along with environment and usage information are be automatically recorded and sent to Graphsignal where they are available for analysis.

Trying it out is easy: 1) sign up for a free account; 2) add the profiler to your ML code and run it; 3) see and analyze the profiles at graphsignal.com. Everything is described in the Quick Start Guide https://graphsignal.com/docs/profiler/quick-start/.

I'm very excited to show it to you here and will appreciate any thoughts, comments and feedback!

3 comments

Looks cool, but let me offer one point of critique: the website is very sparse. Indeed, too sparse for me to hack in my e-mail and risk getting it into yet another newsletter loop.

I'd probably try to provide info on a few more benefits for the prospective user.

Thanks for the feedback, we'll be working on it for sure. At least an explanatory screencast is in the works now, other info material, use cases, etc. are planned.
Why a web service rather than fully client-side software? $$$ business model?
Many training and inference workloads run in the cloud or on remote servers and profiling them is not straightforward. Having a SaaS makes things much simpler, and also enables additional features, such as team access and sharing. As far as the data privacy is concerned, profiles do not contain any model or raw data, just resource usage, execution statistics, etc., which is acceptable for most of the users to send to a third party. And for the business model, in my opinion, SaaS allows to better monetize the offering and ensure better and up-to-date end product in this case. But this is open, we may consider a free client-side version as well at some point.
Keen to give this a go, what version of tensorflow does it work with?
I'd just use the 2.8, but older 2.x versions should work too. If you encounter are any issues, please let us know via the chat in your account.
Any support for 1.4?
To be more precise, >=2.2 is required for profiler support