Hacker News new | ask | show | jobs
by short_sells_poo 1723 days ago
I run a hedge fund. On any given day I hear a large number of complaints from the technologists that complex python systems are difficult to look after and we should use something else instead. There's some Rust being used, but there's little chance to get a quant to use Rust to do research because research is an exploratory process and the last thing one wants is a language that requires a lot of thought about lifetimes etc.

How is the python-ocaml interop story? To be clear, any language that does not have first-class interop with python is basically dead in the water (at least for our case).

10 comments

Have you looked into Julia, Nim, Clojure, or even Common Lisp? I'm not sure bout Python interop with CL, but Nim and Clojure seems to have some kind of beta-grade interop, and there's a solid interop story in Julia. And all of those languages have some of their own "native" data analysis and scientific computing toolkits (Julia having more than "some", of course).

That said, complicated Python systems can be improved a lot by adding type annotations. That's more of a solution for web servers and other "easily type-able" applications. Typing support for scientific computing isn't quite there yet. So it depends on what kinds of systems are the complicated ones.

Thank you, we've dabbled with Julia and indeed it works very well. We are just a bit worried about betting the barn on it so to speak. It's still very niche and we are just not seeing the kind of meteoric rise that Rust is exhibiting for example. we would ideally not want to become the sole caretaker of some niche language. Jane Street can afford it with Ocaml, but we can't :(

For that reason, Julia is being closely watched, but so far we are not thinking of pulling the trigger.

Link to the interop lib for Clojure you're referring to for people who don't know it: https://github.com/clj-python/libpython-clj

Really a remarkable feat of engineering. Here's its author giving a talk: https://www.youtube.com/watch?v=vQPW16_jixs

I was thinking of linking this as well. Clj-python is just such a fascinating junction. I don't care so much for clojure, though I'm continuously impressed by the ecosystem and the productivity of it's experts. Very cool stuff.

Core.logic and the like opened a huge door and similar ideas have exhausted my free time for several years now.

Julias interop with Python is excellent: https://github.com/JuliaPy/PyCall.jl (also with R, see RCall.jl). It's just not statically typed, so the original problem is not solved - albeit Julia being the better language for scientific purposes.

CL could also be great language-wise (https://digikar99.github.io/py4cl2/, https://github.com/snmsts/burgled-batteries3) but I don't know how good the interop is in reality since I haven't tried it.

There is an actively developed python to ocaml interop library for purposes quite similar to yours. I have seen demos where ocaml and python are used within the same jupyter notebook

https://signalsandthreads.com/python-ocaml-and-machine-learn...

https://github.com/thierry-martinez/pyml

Thank you, I'll pass this on. An important feature is zero copy arrays, which seem to be supported.
I personally haven't used it but Jane Street heavily uses OCaml and has written a blog post on this: https://blog.janestreet.com/using-python-and-ocaml-in-the-sa...
I think Elixir would be interesting for your usecase.

It's a dynamic, garbage collected language. It's easy to pick up and get going with. As a functional programming language there isn't a lot to learn in the way of language constructs, and you don't even have to do the 'wrestling with the type system' thing that you have to do in compiled functional languages like OCaml or Haskell (like you do in Rust).

Its processing 'horsepower' is probably comparable to Python, but it's much better for building low latency things if you want to run something in a bit more of a production use case. This is also improving due to the recent addition of a JIT.

The addition of NX is making Elixir an increasingly interesting place to do ML - write Elixir, have it run on GPU etc. See https://dashbit.co/blog/nx-numerical-elixir-is-now-publicly-...

Python integration is probably best done using the Erlang 'port' system - running Python as a managed process and communicating with it using messages over stdin/stdout. I use it for C interop and it works well (and fits well with the Elixir/Erlang process model). It's not difficult to roll your own in Python e.g. https://github.com/fujimisakari/erlang-port-with-python/blob... or look at something like http://erlport.org/

Thank you! So this looks interesting but it seems like there's no easy way to share numpy arrays?

The main use case for a language other than python is a more robust codebase but also performance. We need to be able to efficiently ship lots of large arrays between the languages and the Rust-Python interop supports zero copy arrays for example.

Elixir and Rust are very good friends, so to speak. Writing a library in Rust that you can use from Elixir is only slightly worse than trivially easy.

But I agree somebody has to put the work.

I've made a good career with Elixir but I still don't think it's a good fit for a hedge fund. IMO invest in Rust.

Ah, no. I'm sure that's build able in Elixir using a NIF (function built into the VM, in a similar manner to Python modules written in C) but you'd have implement it, I'm not aware of anything out of the ox.
Python type checking (type annotation, mypy) should at least partially solve the problem of maintaining complex Python systems. Though it doesn't help with performance.
The larger problem in my view is that big Python systems tend to follow OOP design since functional programming patterns do not work well in Python. So you start with something minimal and simple inside a script or notebook, but quickly it evolves into something more like a Java code-base.

Typing does help, agreed.

I strongly suggest the Attrs library for cutting down the boilerplate of making small "data classes": https://attrs.org/

With type annotations, you can move away from "inheritance OO" to logicless "data classes" and functions that operate on them.

Java has good to great FP, making most objects immutable is also trivial. Mypy is good to get strict typing enforced but, but I still prefer the native and slightly wonky typesystem to a tacked on one that builds on trust.
> Java has good to great FP

Java does absolutely not have "good to great FP" support. It's an imperative and OO programming language that recently got lambdas, no more, no less.

Yeah 100% - Java lacks many of the features required for serious FP and the ecosystem of libraries is heavily OOP too (although functional wrappers can usually be implemented).
There's sealed classes, records, pattern matching, optionals. It's getting there.
All good improvements, but Java is missing some really key pieces:

- pipe operator (or custom operators in general)

- do-notation

- tail call optimisation

- currying / partial application

- (better) type inference

- expression orientated

- less syntactic noise around function calls (fewer commas and parenthesis)

- type-classes or even runtime generic information to work around that

Some of the thing I’ve listed can never be added without dramatically changing what Java is. At that point it would be a new language.

Works up until every function has a

  def calc_xxx(df:pandas.DataFrame) -> pd.DataFrame 

type...
I would write principled Python with strict coding standards. Make type annotations mandatory and turn up pylint or flake8 to maximum warnings. It really helps avoid a bunch of silly mistakes, while still providing a way out for doing crazy stuff that Python is good at.
Some of the Python FFI tools are listed here: https://ocamlverse.github.io/content/ffi.html. But clicking through to GitHub, the repos haven't been updated in a while.
I wasn't counting changes to project metadata like gitignore
If your language doesn't worry about lifetimes, they don't go away. It just means you have to worry about them yourself instead.

Sometimes that is great. Other times, that will be very hard and error-prone.

When you are trying to solve a complex optimal payoff problem, you really don't want to get bogged down with lifetimes. That's a completely orthogonal concern to what you are trying to establish. You are not writing production code, you are doing research. It's the core reason why languages with easy REPL and immediate feedback (like matlab, R, python, julia, etc...) are used for research, because you get immediate and interactive feedback. The keyword is interactive.

Once you have to think of types and lifetimes, a lot of the productivity goes down the drain.

99% of the stuff you do in research ends up being consigned to the cutting floor because it doesn't work. The 1% that ends up being useful is the only part worth productionizing.

Most of my day-to-day coding is in Typescript and I often find myself wondering if more than a few jobs wouldn't be easier and faster with plain JS and no need to feed the type checker. Many of the same thoughts about you vs the language tracking things applies here too.

In my case, I'd say that yes, I have to track the types myself, but the tradeoff is at least sometimes worth the extra mental overhead in my opinion. I'd say that the same can be true for lifetimes as well.

In your case, you will definitely be tracking lifetimes on some level and to say otherwise is going to be false (even GC'd languages must track lifetimes to ensure garbage is eliminated). The question is about the mental tradeoff vs the time taken. I'd guess that you are correct in your assessment. My only real point is that there is a cost that should be considered.

You are correct, lifetimes are being tracked in some form at all times. So let me rephrase: lifetimes are irrelevant to the primary work of a quant during research. The objective is to establish whether an idea works. In 99% of the cases, it doesn't and the code and the project are a dead end. Under these circumstances you need to reduce the amount of cognitive overhead that goes into program structure to a minimum.

Research is fundamentally different from the usual programming exercise. Research is like prospecting. You want to try as many different locations as possible. You don't want to build nuclear shelter grade construction at every potential site because 99 out of a 100 is a dud. You'd never find anything.

You want a tool that allows you to get quick results to confirm if a site has potential and then you want to be able to scale your tools for proper mining.

An extremely simple thing like having two objects stored in a struct where one object has a reference to the other is a Herculean task in Rust. This is not a language designed for prototyping...
This is perhaps where I get hate from both sides, but I think such cases are best done in unsafe code blocks.

A tool that makes something much harder without any meaningful gain should be avoided. Rust provides the tools to not have to fight the system and they should be used in these situations.

That's what my solution is. I am not going to get into pins and get cargo crates to solve such a simple problem. I just resort to unsafe. But then at that point...C++ is easier for me.
You can easily avoid such constructs in most cases though.
I've allocated both these things on the heap and Rust simply won't let me store them both together. I don't know about you but this is an extremely common pattern in almost all languages, you just don't think about it in the gc languages and in C++ storing pointers is no issue. The popularity of crates like rental also shows that it's not as easily avoidable as you suggest.
Can you post a simple code example? It is hard to imagine what the difficulty is.
You can read through this SO question and the top response does a good job explaining the possible solutions: https://stackoverflow.com/questions/32300132/why-cant-i-stor...
Its not the language its the people, instagram is almost entirely run on python, if they can so can you. https://instagram-engineering.com/tagged/python
This is a terribly misinformed take. If you throw enough resources at Python then sure, you can probably get adequate throughput. The problem is that in finance a lot of problems require you to think about latency, which is a total non-starter for Python
If it were a total non starter than why would their entire company be using it?
It's not clear which company you're referring to, but

- Instagram was at one point a startup. It's common for startups to write in a scripting language to optimize for speed of adding features. Then if you grow or get acquired, the scripting language often eventually gets replaced by a language more optimized for maintenance cost, safety, and/or speed.

- Data scientists often only really know scripting languages, and at any rate scripting languages are useful for prototyping algorithms that need to change daily. Hence a lot of hedge funds use Python. For code that is stable and really matters for performance purposes, it's common for funds to use C/C++ or even FPGAs.

I have no doubts that we are able to handle our requirements with python, but if there's a better way, does it not make sense to investigate?

A skilled carpenter can undoubtedly use a hammer instead of a screwdriver. This doesn't mean that I should insist they use a hammer when a screwdriver would do a better job.

Sounds like they're already using Rust where performance matters and are looking to switch
Instagram has much lower latency requirements…
I believe they’re referring to the hedge fund. Lower latency requirement is a confusing way of phrasing it as lower latency is a higher bar. But generally most hedge funds do not need to operate at the speed Instagram does.
So Instagram runs on Python? That only proves that the Instagram team can build Instagram in Python. How does that help me with my technology choices?
> last thing one wants is a language that requires a lot of thought about lifetimes etc.

I challenge you: A lack of understanding about the data lifetimes in a program means lack of understanding about the data.

Not saying you can't have a lot of short-lived data items that you don't want to manage one-by-one. I'm saying that for the vast majority of data items, one should be able to give a reasonably well defined lifetime upper bound. So a good solution is to make a few boxes that group items by lifetime. And from time to time, throw the outdated boxes away.

And of the few items that don't have such an upper bound at creation time, many can be created in a special box that allows migrating boxes later when required.

> A lack of understanding about the data lifetimes in a program means lack of understanding about the data.

But this argument can extend forever.

Is your program precisely dependently typed? If not is that a lack of understanding about the nature of the data as well and should you challenge yourself to fix that?

You have to trade-off how much you specify things with how valuable it is to get the result more quickly.

What you say is true. I only brought up "boxes" because the concept is still not widely known.
You don't have to challenge the person you're responding to. You have to challenge their quants. And they're not going to want to add that into the million other things they're thinking about while doing research in a Jupyter notebook or something.

You're just not going to get this buy-in from people who want to use a tool to get their work done.

Thanks, but I think we may be talking cross purpose here. 99% of the research code ends up being thrown away (well, archived). Not because it's bad code necessarily, but because the idea that was being prototyped is a dead end. This means it's paramount that the language you use has to be as low friction and interactive as possible.

Imagine you are trying to establish whether there's a relationship between timeseries X and timeseries Y. You just want a tool that allows you to quickly calculate some summary statistics of these timeseries, clean them, convince yourself that they behave according to your expectations and then run some form of regression.

Nowhere in this process do you care about lifetimes. It's literally irrelevant. In fact, as long as all your work fits into memory, you don't even care about memory management. Your objective is to answer the primary question, everything else is a costly distraction.

The 1% of ideas that ends up being worthwhile is what gets productionized and needs to be robust. But obviously rewriting everything from language A to radically different language B adds it's own headaches.