| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kahnjw 2502 days ago

This is just not true. The python runtime is not the bottleneck. DL frameworks are DSLs written on top of piles of highly optimized C++ code that is executed as independently from the python runtime as possible. Optimizing the python or swapping it out for some other language is not going to buy you anything except a ton of work. We can argue about using rust to implement the lower level ops instead of c++. That might be sensible though not from a perspective of performance.

In a "serving environment" where latency actually matters there are already a plethora of solutions for running models directly from a C++ binary, no python needed.

This is a solved problem and people trying to re-invent the wheel with "optimized" implementations are going to be disappointed when they realize their solution doesn't improve anything.

2 comments

danielscrubs 2502 days ago

Yes. Let’s say you want certain features of a voice sample. You need to do that feature engineering every time before you send it to the model. Doesn’t it make sense to do it in C++ or Rust? This is currently already done. So if you already are starting to do parts of the feature engineering in Rust why not continue?

Yeah it’s not reasonable right now because Python has the best ecosystem. But that will not always be the case!

link

kahnjw 2502 days ago

I can’t exactly tell what you mean but I think you’re confusing two levels of abstraction here. C++ (or rust) and python already work in harmony to make training efficient.

1. In tensorflow and similar frameworks the Python runtime is used to compose highly optimized operations to create a trainable graph. 2. C++ is used to implement those highly optimized ops. If you have some novel feature engineering and you need better throughput performance than a pure python op can give you, you’d implement the most general viable c++ (or rust) op and then use that wrapped op in python.

This is how large companies scale machine learning in general, though this applies to all ops not just feature engineering specific ones.

There is no way that Instagram is using a pure python image processing lib to prep images for their porn detection models. That would cost too much money and take way too much time. Instead they almost certainly wrap some c++ in some python and move on to more important things.

link

danielscrubs 2501 days ago

I know. That’s how we do it too. You don’t see any benefits in instead of Python wrappers + C++ just do Rust? Especially in handling large data like voice iff there was a good ecosystem and toolbox in place?

link

kahnjw 2501 days ago

Maybe but then we’re no longer making an argument about performance, which is what I was responding to in your initial claim about “everything counts” and numpy shuffle being slow. That’s a straw man argument that has zero bearing on actual engineering decisions.

EDIT: clarification in first sentence

link

pohl 2502 days ago

The python runtime is not the bottleneck.

This smells like an overgeneralization. Often things that aren’t a bottleneck in the context of the problems you’ve faced might at least be an unacceptable cost in the context of the 16.6 ms budget someone else is working within.

link

kahnjw 2502 days ago

In what circumstance would one measure end to end time budget in training? What would that metric tell you? You don't care about latency, you care about throughput, which can be scaled nearly completely independently of the "wrapper language" for lack of a better term, in this case that's python.

It seems some commenters on this thread have not really thought through the lifecycle of a learned model and the tradeoffs existing frameworks exploit to make things fast _and_ easy to use. In training we care about throughput. Thats great because we can use a high level DSL to construct some graph that trains in a highly concurrent execution mode or on dedicated hardware. Using the high level DSL is what allows us to abstract away these details and still get good training throughput. Tradeoffs still bleed out of the abstraction (think batch size, network size, architecture etc have effect on how efficient certain hardware will be) but that is inevitable when you're moving from CPU to GPU to ASIC.

When you are done training and you want to use the model in a low latency environment you use a c/c++ binary to serve it. Latency matters there, so exploit the fact that you're no longer defining a model (no need for a fancy DSL) and just serve it from a very simple but highly optimized API.

link