Hacker News new | ask | show | jobs
by aschampion 2942 days ago
It's mainly targeting MATLAB, and to a slightly lesser extent scientific and numeric programming in Python and R. It's a well thought out language that allows you to write MATLAB-like high level code with an easy gradient for progressive typing and optimization to near C level performance. It's a bit harder to sell versus Python, since Python has enormous value in the ecosystem, community, and ubiquity. Also, because it primary targets MATLAB a lot of the standard libraries try to have similar ergonomics, which is a bit of a waste of a great tool to recreate a poor interface.
4 comments

It's definitely targeting Python and R to the same extent as MATLAB, in the sense that it claims to solve the two-language problem that is so apparent in these languages. MATLAB, Python and R are easy scripting languages, but as soon as you have to do heavy computations, you're forced to call C / FORTRAN libraries. Julia on the other hand is prove that we can have a high-level scripting language that runs as fast as C and Fortran. Combine this with Julia's generic programming and type system, and you can easily run your algorithm with floats, complex numbers, arbitrary precision, etc etc.

Even if Julia wraps a library like Tensorflow, its API is looking really nice compared to Python [1]:

  using TensorFlow

  sess = TensorFlow.Session()

  x = TensorFlow.constant(Float64[1,2])
  y = TensorFlow.Variable(Float64[3,4])
  z = TensorFlow.placeholder(Float64)

  w = exp(x + z + -y)

  run(sess, TensorFlow.global_variables_initializer())
  res = run(sess, w, Dict(z=>Float64[1,2]))
  Base.Test.@test res[1] ≈ exp(-1)
[1] https://github.com/malmaud/TensorFlow.jl
To compete with R it needs something like the tidyverse.
I agree - declarative in memory dataframe manipulation is extremely powerful. And the composability of plotting in the tidyverse is really nice as well.

It looks like there are the beginnings of both of these in Julia:

[0] http://gadflyjl.org/stable/

[1] https://github.com/JuliaStats/DataFramesMeta.jl

I often ask myself the question, how can Julia do things that R cannot do. After all, when something is good at doing something, why replace it?

Part of this is why we did JuliaDB: http://juliadb.org/ and continue to try push the boundaries on parallelism, missing data, OnlineStats.jl and making data manipulation and modeling that much easier.

In some sense, it doesn't really need to compete with R, many times it's better just to use the R, Python, Java, C++, packages via RCall, PyCall, JavaCall, Cxx, or use the built-in ccall to use libraries written in any number of languages that conform to the C ABI (C, Fortran, Rust, ...).

I've joked before, about how there is no such thing as a "One Language To Replace Them All", however, I feel Julia is the best candidate for the "One Language To Rule Them All", since while it solves the "two-language" problem in many cases, you can use it bind code written in many languages together (hopefully in a bit nicer fashion than the "One Ring" bound the other rings and their users!)

Agree completely with this, tidyverse is whats keeping me in R when I'd prefer to mostly use Python.
Right now it’s a hard sell vs Python, but I can imagine Python running out of runway soon. A lot of it’s existing scientific and statistical computing stack is built around the assumption that you’ll be working with data that conveniently fits in memory. Once you’ve sized out of pandas/scipy/scikit, your next major option is Spark, which is certainly powerful, but is also unwieldy.

I could see something like Julia earning a lot of mindshare if it had a really polished solution for the space between, “my data is hundreds of megabytes”, and, “my data is hundreds of gigabytes”.

Speaking as someone who uses Python with half a terabyte of memory, I think you're underestimating how much memory these labs will use. In my experience most HPC architecture is optimized first by rewriting the code in the same (already fast) language or library, then by increasing hardware resources (especially among distributed nodes), then by seeking a new library in the same ecosystem, and finally by moving to a new language if they have to.

Moving to a new language has more friction than basically anything else unless there's a real language feature missing or the budget doesn't allow for more compute hardware. Hundreds of gigabytes is well below where academic and industry labs will start having to think about these problems. It's going to be really tough to displace Python with anything equally as general purpose.

This is all to say that I buy that Julia can shine more than Python for I/O bound HPC, but it really shouldn't be I/O bound until you have terabytes of data (and likely tens of terabytes). And aside from that, the Python numerical computing ecosystem includes a lot more than just Numpy and Pandas. As other commenters have mentioned, you can use Dask if your hot data has grown into the terabyte range. Anaconda includes a lot of libraries which can bail you out of situations once you've left the familiar world of Pandas data frames.

It's not so much hard drive I/O as it is network I/O. Both the obvious waiting for non parallel data, but also the time it takes to get something physically across a room from one cpu to another.

https://en.m.wikipedia.org/wiki/Gustafson%27s_law

For instance.

All high performance code is IO bound.

Even highly tuned c++ code spends most of it's time on the CPU waiting on cache misses. Its a pretty exceptional usecase where your code is not IO bound

Wait what? I'm using "I/O bound" in the sense of CPU operations waiting on reads/writes to e.g. a disk. If you consider an operation waiting on the cache to be I/O bound, what do you consider to be CPU bound or cache bound? And what terminology would you use to refer to operations which are waiting on the disk as opposed to the cache? What about memory instead of the cache?

I think it's useful to differentiate between an operation which is purely CPU bound (i.e. it's just constantly calculating without a reference) and an operation which is cache bound (faster than memory but still bound over the CPU). But calling operations I/O bound when they're sufficiently optimized that they live in the cache and don't even hit memory, let alone disk is an abuse of terminology. In the context of what I'm talking about, most HPC is absolutely not I/O bound unless it's using SATA/SAS drives instead of cache and memory.

And circling back to my original point, most research labs which can afford it will sufficiently optimize their code and hardware so that they don't hit the disk unless they're working with terabytes of data. Python, C++ and R provide numerous packages between the three of them for numerical computing across each of these bottlenecks, so I don't think Julia can rely on differentiating itself by shining in an I/O bound setting (i.e., waiting on disk). And if it does, "hundreds of gigabytes" isn't really the data size in which people are (in my opinion) going to overcome the friction of a new language and ecosystem just to harvest those benefits.

CPU bound in HPC circles usually means you are somewhere in the order of magnitude of the maximum FLOPs your cpu can do (eg. GHz clock of the cpu * cores number of operations in terms of GB/s of data). That's almost never the case.

Operations waiting on the disk are "disk bound", IO bound is the generic term for data access, instead of cpu processing, being the bottleneck (which is usually the case, just a question of where).

I agree that Julia probably doesn't have a huge advantage, that said, having been stuck with slow Python code before, you're often stuck with rewriting large parts of the system in c++ or another low level language. That's the reality of Python, but at least it's not the most painful thing to do.

the old joke is "HPC is the art of taking a cpu-bound computation and making it I/O bound".
It's not an especially weird use of terminology. "I/O bound" means "waiting on reads and/or writes." Cache or RAM or disk, it's all about communication throughput and latency, efficient access patterns etc.

In a compute-bound workload the CPU spends the bulk of its time actually retiring instructions, not stalled waiting on data.

Think about it from the perspective of what the FPU sees -- once it has done that FMA operation, does it have the data it needs to do the next one, or does it need to sit on its hands for a while?

The cache hierarchy, cache-friendly data structures and algorithms -- they all aim to reduce time spent waiting on IO.

I understand that, but I don't think it's useful to use use the term "I/O" in the theoretical sense of the concurrency problem. It's also not typical nomenclature - for example, see https://stackoverflow.com/questions/868568/what-do-the-terms.... We have terms like "CPU bound", "cache bound", and "memory bound", but we don't really have "disk bound" in common usage. This is because the common usage is "I/O bound".

Theoretically speaking we can model any process as one which has to wait and one which doesn't have to wait. But in modern usage we have a variety of types and speeds for reads and writes. When I/O simply means reads and writes, you lose all the practical granularity you'd otherwise get by decomposing the reads/writes into different bottlenecks. It's philosophically elegant, but practically unhelpful for optimizing HPC and distributed systems when, as the responder said, it's rare to be CPU bound.

I also think that the context of my original comment is pretty clearly using I/O in the modern sense of disk usage. Responding with a correction that everything is I/O bound is vacuous, not insightful.

Better cache access patterns are one of the reasons Julia's dot-broadcasting[1] is super cool. If you have big vectors `a` and `b`, the expression `sin.(a .+ exp.(b))` will do a single pass over your data, calling `sin(a[i]+exp(b[i]))` for each element, rather than creating big temporary arrays for the intermediate expressions and looping through multiple times.

Then because putting all those dots can be unwieldy, there's the `@.` macro which puts a dot on all your function calls.

[1]: https://julialang.org/blog/2017/01/moredots

I'm not sure how this makes sense. Tuning C++ for speed is mostly weeding out cache misses through memory access patterns. If memory is accesses linearly the prefetcher will get it ahead of time. If cache sizes are taken into account, you can not only cut down on memory latency, but memory bandwidth as well.
> Once you’ve sized out of pandas/scipy/scikit, your next major option is Spark, which is certainly powerful, but is also unwieldy.

There's also Dask [1], a native Python framework for distributed computations (by Anaconda). Irina Truong gave an excellent talk at PyCon 2018 about it [2]. I had never thought to look into Dask because Spark worked well for my use cases, but it has a lot of advantages over Spark (e.g. speed -- it's faster and more lightweight than PySpark and has no JVM serialization overhead) if you're using Python. Dask also runs on Kubernetes clusters, so scaling is not an issue.

And yeah, a huge amount of important data analysis work will continue to be done on data that fits in memory. Data analysis on distributed datasets is important, but from what I can tell, outside of certain domains it's certainly not the majority of the data analysis work out there.

[1] http://dask.pydata.org/en/latest/spark.html

[2] https://www.youtube.com/watch?v=X4YHGKj3V5M

Julia actually has that, though polishing is still ongoing. Checkout JuliaDB [0]. It works well for noodling around in a REPL but can also smoothly deal with huge data and processing distributed across multiple computers, all while leveraging the native Julia ecosystem which is much nicer than numpy and with lower overhead.

[0] https://juliadb.org

Python does not make these assumptions. There are Python tools that exist to solve these problems that are equally as powerful as other language's solutions. The two that I believe right now address these problems best are dask and mpi4py. Mpi4py can achieve very low latencies but given that it's based on MPI it can be complex to use. dask is the most user friendly and as a Python user is clearly easier to use than spark. Paired with numba you can get equivalent performance to distributed C programs.
Dask is another choice for distributed out-of-memory data structures but still within the python ecosystem
Language implementation-wise, can anyone explain why/how Julia is able to get close to C-level performance? Is it doing some extra steps under the hood (JIT compilation?) that Python and R aren't doing?
Julia's JIT compilation is rather different than what is referred to as JIT compilation in other languages, such as Java or JavaScript, where the language is interpreted (which may be interpreting instructions from a virtual machine such as the JVM), and the run-time decides if some code is being hit frequently enough to warrant compilation to native code. Julia first compiles to an AST representation (also expanding macros, etc), performs type inference, etc. When a method is called with types that haven't been used before to call that method, that's when Julia does it's magic and compiles a version of that method specialized for those types, using LLVM to generate the final machine code (just like most C and C++ implementations these days, as well as Rust and others). That also means that it's rare for Julia to have to dynamically dispatch methods based on the type of the arguments, which is one of the things that can really slow down other languages with dynamic types.
The easiest way to think about Julia's performance is closely related to the observations that inspire tracing JIT's for many languages -- most code in dynamic languages doesn't make use of the features that make efficient compilation impossible. Julia's response to that observation was to build a dynamic language that lacked some of the most extreme features in Python or R that act as barriers to efficient compilation.
It's also worth noting that Julia's JIT isn't tracing: it does all its compilation before the code is run (unless it hits a path which hasn't been run before, or wasn't inlined, in which case it runs the compiler again). I've heard it described as "really an ahead-of-time compiler that just runs really, really late".
BTW, is there a write-up of what those blocking features are? I don't recall ever seeing a blog about that. Could be an interesting e"if you want to make a JIT-friendly language, don't do this, do this instead" type of article.
I agree that would be great. The crude answer is: make it easier for a computer to figure out what will happen when you run the code.

The example I usually use is allowing integers to overflow, instead of automatically promoting to arbitrary precision (Python), or converting to a sentinel value (R). Integers are used in a _lot_ of places, so inserting these checks (or worse, access to heap-allocated memory) makes it difficult to optimise. (throwing an error might be a reasonable alternative in some cases).

Another is that you make it easier for the compiler to figure out things about an object, such as its size (e.g. you can declare the types of the fields of a Julia struct) and whether or not it can be mutated (immutable objects are easier to optimise).

I wouldn't use that as a primary example (allowing integers to overflow), because one of the great things about Julia is that it is incredibly easy to define your own types that will simply work, that for example, do checked arithmetic on integers (SaferIntegers.jl, I think is one, or don't want a limit (BigInt, which is included in Julia). Julia gives the programmer the choice, and not only that, allows the programmer to create their own choices.
> The example I usually use is allowing integers to overflow, instead of automatically promoting to arbitrary precision (Python), or converting to a sentinel value (R).

IIRC Julia used to automatically promote integers, is this the main reason why this was dropped?

No, I don't believe integers ever promoted on overflow (or at least not since 2012).

If an operation involves two different integer types, they do promote to the larger one (i.e. an Int64 + a BigInt will give a BigInt).

One of them is being able to override `setattr` and `getattr` at runtime in Python. It can be pretty tricky to prove it can't happen, so (unless you have optimistic and pessimistic codepaths) you get into a situation where every attribute lookup makes indirect function calls and hash-table lookups.
Yes, it's JIT compiled.

And my (very crude) understanding is that the stronger type system makes this much easier than in Python. The compiled version of any function is specific to the types of its inputs, and thus need not contain any further checks: simple functions often end up with literally the same assembly as C would produce.

It's "extremely lazy ahead of time compiled", is one way I've described the compilation model, since you're basically never executing code in an interpreted fashion (usually jits let you do either). Also, typically jit's choice of when to but may be non-deterministic, or deterministic but difficult to understand. When Julia chooses to compile is pretty easy to understand
I believe though that there is some work being done on actually directly interpreting the AST, in cases where going through all the work of generating LLVM IR and compiling that to native code is unnecessary, particularly when it is code that is only run once when a package is compiled the first time.
If anyone wants to do more research on this, the keyword is "monomorphization".
Perhaps the best way to understand what makes Julia fast is to watch these two videos about Python and R and what makes them so hard to optimize:

https://www.youtube.com/watch?v=qCGofLIzX6g

https://www.youtube.com/watch?v=HStF1RJOyxI

Take everything mentioned in these videos that make Python and R really hard to optimize and don't do those things :D

Yes, it's using JIT compilation (last I checked, they are using LLVM as the backend). Combined with a language design that takes JIT compilation into account from the get-go, making the problem much easier than trying to use a JIT later on (see e.g. PyPy).
I often wonder what inspires folks to start from scratch in the face of a gigantic ecosystem like the one that Python brings with it, which will also keep improving.
In the case of Julia, you can check out the original motivation back in 2012 or this recent answer on the message board.

https://julialang.org/blog/2012/02/why-we-created-julia

https://discourse.julialang.org/t/julia-motivation-why-weren...

If you have some Matlab background, working with Python is frustrating. It is hard to explain, but vectors and matrices should be the primary concepts, with absolutely minimum extra glue needed.
That's definitely a thing. I like Pandas, but the syntax is a bit cumbersome compared to R or Julia.
I don't have much experience with Matlab, so forgive me if this is incorrect, but numpy should be able to do everything that matlab can at comparable speeds. No one performing matrix/vector-like operations is doing so with standard Python lists if numpy is available.
I'm not talking about execution speed but the human interface. The syntax of Python just is not nice and using libraries just adds more and more boilerplate.

I don't expect anyone who has not spent a lot of time with Matlab to "get" it.

Here's how I explain it to people who don't get it.

https://cheatsheets.quantecon.org/

Python works, but it's far from elegant in this domain.

I know what you mean. In MATLAB

  [A, B]
concatenates two matrices/vectors, whereas in Python it "wraps" them in an `n+1`-dimensional "matrix".

That said, you get most of what you want with libraries. In NumPy you won't write

  [aRow + bRow for aRow, bRow in zip(aMat, bMat)]
because you'll just call `numpy.concatenate`.

Also, Python is mostly not used for mathematical work, and programmers tend to assume matrices are scary or only useful for mathematical work, so it has a "boring" syntax more suited for operating on single items at a time, with lots of loops.

> "The syntax of Python just is not nice and using libraries just adds more and more boilerplate."

I think that is a necessary trade-off for Python as a "general-purpose" programming language. I had used MATLAB and IDL quite intensively before I moved on to Python and R. When writing MATLAB, I felt like a scientist and did not have to bother with programming practices, like coding style, unit tests, writing functions instead of scripts, etc. But Python forces me to think like both a scientist and a programmer. (For example, every time you write `np.array([1, 2, 3])` instead of `[1, 2, 3]` it reminds you that array operation is not a free lunch offered by the language; it comes from the NumPy library. Also, it keeps the namespace pure.) I personally like this way better. But I also agree that not everyone likes it. (In my institution, researchers are kinda split half-and-half between Python and MATLAB.)

It's interesting because those were the reasons I heard from MATLAB users for switching to Python: moving to a language with a cleaner, less ad-hoc design and less boilerplate / copy-paste code made a big difference once you had more than a little code.

Has the language improved dramatically in the last few decade?

But numpy is the underlying library used by the rest of the SciPy stack, right? I use Pandas, so taking that as an example, it can be slow when you're doing stuff that isn't mostly leveraging Numpy. If I have to loop over a relatively large dataset to do some complicated row checking/filtering stuff, then it can be very slow, and I might as well take a coffee break. You can rewrite that into using just the numpy values array and it will be performant, but you lose all the nice Pandas features when you do that.

Also, if I'm just restricted to using Pandas on a laptop or small server instance, then loading in a several gigabyte csv file can really tax the memory.

Sometimes languages get stuck by their history, and (without really becoming a rather different incompatible language) the only way forward is to start from scratch. Also, Julia is good at letting you use those old ecosystems, C and Fortran libraries, Python, Java, R, all from the comfort of home (Julia)
Well, there are big areas where the ecosystem is quite underdeveloped, like differential equations, which have a lot of holes that need new algorithms and improvement. It would be extremely difficult to develop all of the necessary algorithms in C/C++/Fortran, and pretty much impossible in Python/MATLAB (I tried at first), but it's a breeze to tackle this in Julia. So for these kinds of scientific computing areas where there's tons of work with few people with the necessary expertise, Julia is a great way to start getting some good implementations out there for people to use.
Hubris. Which is one of the three great virtues of a programmer.
I'm pretty sure the other two were as well.
In part, the fact that there are major problems moving the "giant" to where one would like (speed), e.g., unladen swallow, pypy, pyston, ...