| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by throwawaymath 2944 days ago

Speaking as someone who uses Python with half a terabyte of memory, I think you're underestimating how much memory these labs will use. In my experience most HPC architecture is optimized first by rewriting the code in the same (already fast) language or library, then by increasing hardware resources (especially among distributed nodes), then by seeking a new library in the same ecosystem, and finally by moving to a new language if they have to.

Moving to a new language has more friction than basically anything else unless there's a real language feature missing or the budget doesn't allow for more compute hardware. Hundreds of gigabytes is well below where academic and industry labs will start having to think about these problems. It's going to be really tough to displace Python with anything equally as general purpose.

This is all to say that I buy that Julia can shine more than Python for I/O bound HPC, but it really shouldn't be I/O bound until you have terabytes of data (and likely tens of terabytes). And aside from that, the Python numerical computing ecosystem includes a lot more than just Numpy and Pandas. As other commenters have mentioned, you can use Dask if your hot data has grown into the terabyte range. Anaconda includes a lot of libraries which can bail you out of situations once you've left the familiar world of Pandas data frames.

2 comments

Avshalom 2943 days ago

It's not so much hard drive I/O as it is network I/O. Both the obvious waiting for non parallel data, but also the time it takes to get something physically across a room from one cpu to another.

https://en.m.wikipedia.org/wiki/Gustafson%27s_law

For instance.

link

VHRanger 2944 days ago

All high performance code is IO bound.

Even highly tuned c++ code spends most of it's time on the CPU waiting on cache misses. Its a pretty exceptional usecase where your code is not IO bound

link

throwawaymath 2944 days ago

Wait what? I'm using "I/O bound" in the sense of CPU operations waiting on reads/writes to e.g. a disk. If you consider an operation waiting on the cache to be I/O bound, what do you consider to be CPU bound or cache bound? And what terminology would you use to refer to operations which are waiting on the disk as opposed to the cache? What about memory instead of the cache?

I think it's useful to differentiate between an operation which is purely CPU bound (i.e. it's just constantly calculating without a reference) and an operation which is cache bound (faster than memory but still bound over the CPU). But calling operations I/O bound when they're sufficiently optimized that they live in the cache and don't even hit memory, let alone disk is an abuse of terminology. In the context of what I'm talking about, most HPC is absolutely not I/O bound unless it's using SATA/SAS drives instead of cache and memory.

And circling back to my original point, most research labs which can afford it will sufficiently optimize their code and hardware so that they don't hit the disk unless they're working with terabytes of data. Python, C++ and R provide numerous packages between the three of them for numerical computing across each of these bottlenecks, so I don't think Julia can rely on differentiating itself by shining in an I/O bound setting (i.e., waiting on disk). And if it does, "hundreds of gigabytes" isn't really the data size in which people are (in my opinion) going to overcome the friction of a new language and ecosystem just to harvest those benefits.

link

VHRanger 2942 days ago

CPU bound in HPC circles usually means you are somewhere in the order of magnitude of the maximum FLOPs your cpu can do (eg. GHz clock of the cpu * cores number of operations in terms of GB/s of data). That's almost never the case.

Operations waiting on the disk are "disk bound", IO bound is the generic term for data access, instead of cpu processing, being the bottleneck (which is usually the case, just a question of where).

I agree that Julia probably doesn't have a huge advantage, that said, having been stuck with slow Python code before, you're often stuck with rewriting large parts of the system in c++ or another low level language. That's the reality of Python, but at least it's not the most painful thing to do.

link

dnautics 2944 days ago

the old joke is "HPC is the art of taking a cpu-bound computation and making it I/O bound".

link

repsilat 2944 days ago

It's not an especially weird use of terminology. "I/O bound" means "waiting on reads and/or writes." Cache or RAM or disk, it's all about communication throughput and latency, efficient access patterns etc.

In a compute-bound workload the CPU spends the bulk of its time actually retiring instructions, not stalled waiting on data.

Think about it from the perspective of what the FPU sees -- once it has done that FMA operation, does it have the data it needs to do the next one, or does it need to sit on its hands for a while?

The cache hierarchy, cache-friendly data structures and algorithms -- they all aim to reduce time spent waiting on IO.

link

throwawaymath 2944 days ago

I understand that, but I don't think it's useful to use use the term "I/O" in the theoretical sense of the concurrency problem. It's also not typical nomenclature - for example, see https://stackoverflow.com/questions/868568/what-do-the-terms.... We have terms like "CPU bound", "cache bound", and "memory bound", but we don't really have "disk bound" in common usage. This is because the common usage is "I/O bound".

Theoretically speaking we can model any process as one which has to wait and one which doesn't have to wait. But in modern usage we have a variety of types and speeds for reads and writes. When I/O simply means reads and writes, you lose all the practical granularity you'd otherwise get by decomposing the reads/writes into different bottlenecks. It's philosophically elegant, but practically unhelpful for optimizing HPC and distributed systems when, as the responder said, it's rare to be CPU bound.

I also think that the context of my original comment is pretty clearly using I/O in the modern sense of disk usage. Responding with a correction that everything is I/O bound is vacuous, not insightful.

link

ChrisRackauckas 2944 days ago

>it's rare to be CPU bound.

Not when solving (partial) differential equations, which is what I am using Julia for.

link

repsilat 2944 days ago

Also: CPU-bound problems used to be much more common. We spent a lot of effort making CPUs fast though, and memory access didn't keep up. It's why we have these deep caches, it's why we have out of order execution and speculative branch prediction -- to keep processors fed with data.

Used to be you could count cycles, now that's only really true in the simplest of cases with trivial memory access patterns. Now high-information branches are much more expensive than cycle counting would have us believe, ditto pointer-chasing.

link

ssfrr 2944 days ago

Better cache access patterns are one of the reasons Julia's dot-broadcasting[1] is super cool. If you have big vectors `a` and `b`, the expression `sin.(a .+ exp.(b))` will do a single pass over your data, calling `sin(a[i]+exp(b[i]))` for each element, rather than creating big temporary arrays for the intermediate expressions and looping through multiple times.

Then because putting all those dots can be unwieldy, there's the `@.` macro which puts a dot on all your function calls.

[1]: https://julialang.org/blog/2017/01/moredots

link

CyberDildonics 2944 days ago

I'm not sure how this makes sense. Tuning C++ for speed is mostly weeding out cache misses through memory access patterns. If memory is accesses linearly the prefetcher will get it ahead of time. If cache sizes are taken into account, you can not only cut down on memory latency, but memory bandwidth as well.

link