Hacker News new | ask | show | jobs
by thehappyfellow 529 days ago
How come e.g. Jane Street uses it so much? It’s the second most common type of test I write.
4 comments

Jane Street uses OCaml and property based tests are easiest when dealing with pure functions, and are taught in FP classes usually, so I assume it’s that. Easier to setup and target audience.

Edit: also a numerical domain, which is the easiest type to use them for in my experience!

The same reason Google burns $50M+ in electricity each year using protobufs instead of a more efficient format. An individual company having specific needs isn't at odds with a general statement being broadly true.
How’s that comparable at all? There are no network effects from writing property based tests, people use them if they are helpful - are they testing enough of the code with reasonable amount of effort. Nobody’s forcing people to write tests, unlike Google forces usage of protobuf on all projects there.
It's comparable in the way described in sentence #2:

> An individual company having specific needs isn't at odds with a general statement being broadly true.

Google needs certain things more than reduced carbon emissions, and Jane Street needs certain things more than whatever else they could spend that dev time on.

Fine but cutting the thought process at "it depends" is not a great way to understand what's happening here. You can explain anything happening at any company by saying "they need certain things more than whatever else they could spend that time on".

Why is PBT useful at Jane Street, at least more than in other places? Is it the use of functional language? Average Jane Street dev being more familiar with PBT? Is the domain particularly suited to this style of testing?

Explicitly, my claim is that the biggest bottleneck is education on how to use PBT effectively and Jane Street is not using them to get an extra mile in safety, they use it because it's the easiest way to write large chunk of the tests.

>Why is PBT useful at Jane Street, at least more than in other places?

Because trading firms write a lot more algorithmic code than most businesses. Trading strategy code is intensely algorithmic and calculation heavy by its very nature as is a lot of the support code written around it.

At least, that's what it was like when I worked in a trading firm. Relatedly, it was one of the few projects Id worked on where having 95% unit tests and 5% integration tests made perfect sense. It fitted the nature of the code, which wasnt typical of most businesses.

Somebody else wrote that they wrote a lot of numerical code in another business for which property testing is extremely useful and again, I dont doubt that either. 95% is still != 100% though.

Not to derail but what’s more efficient in your view? We compared messagepack, standard http/json and probufs for an internal service and protobufs came out tops on every measure we had.
The gold standard is a purpose-built protocol for each message, usually coming in ~20x faster and ~2-8x smaller than a comparable proto (it's perhaps obvious why Google doesn't do this, since the developer workload is increased for every message even in a single language, and it's linear in the number of languages you support, without the ability to shove most of the bugginess questions to a single shared library, and backwards compatibility is complicated with custom protocols -- they really do want you to be able to link against most g3 code without interop concerns). I've had a lot of success in my career with custom protocols in performance-sensitive applications, and I wouldn't hesitate to do it again.

Barring that though, capnproto and flatbuffers (perhaps with compression on slow networks) are usually faster than protos. Other people have observed that performance deficit on many occasions and made smaller moderately general-purpose libraries before too (like SBE). They all have their own flavors of warts, but they're all often much faster for normal use cases than protos.

As a hybrid, each project defining its own (de)serializer library can work well too. I've done that a few times, and it's pretty easy to squeeze out 10x-20x throughput for the serialization features your project actually needs while still only writing the serialization crap once and reusing it for all your data types.

Recapping on a few reasons why protos are slow:

- There's a data dependency built into the wire format which is very hard to work around. It blocks nearly all attempts at CPU pipelining aND vectorization.

- Lengths are prefixed (and the data is variable-length), requiring (recursively) you to serialize a submessage before serializing its header -- either requiring copies or undersized syscalls.

- Fields are allowed to appear in any order, preventing any sort of code which might make the branch predictor happy.

- Some non-"zero-copy" protocols are still quite fast since you can get away with a single allocation. Since several decisions make walking the structure slow, that's way more expensive that it should be for protos, requiring either multiple (slow) walks or recursive allocations.

- The complexity of the format opens up protos to user error. Nonsense like using a 10-byte slow-to-decode-varint for the constant -1 instead of either 1, 4, or 8 fast-to-decode bytes (which _are_ supported by the wire format, but in the wild I see a lot of poorly suited proto specs).

- The premise in the protocol that you'll decode the entire type exactly as the proto defines prevents a lot of downstream optimizations. If you want a shared data language (the `.proto` file), you have to modify that language to enforce, e.g., non-nullability constraints (you'd prefer to quickly short-circuit those as parse errors, but instead you need extra runtime logic to parse the parsed proto). You start having to trade off reusability for performance.

And so on. It's an elegant format that solves some real problems, but there are precious few cases where it's a top contender for performance (those cases tend to look like bulk data in some primitive type protos handle well, as opposed to arbitrary nesting of 1000 unrelated fields).

Specific languages might have (of course) failed to optimize other options so much that protos still win. It sounds like you're using golang, which I've not done much with (coming from other languages, I'm mildly surprised that messagepack didn't win any of your measurements), and by all means you should choose tools based on the data you have. My complaints are all about what the CPU is capable of for a given protocol, and how optimization looks from a systems language perspective.

What does a 'purpose-built protocol for each message' look like? You avoid type/tagging overhead, but other than that I'd expect a ""sufficiently smart"" generic protocol to be able to achieve the same level of e.g. data layout optimization. Obviously ProtoBuf in particular is pessimising for the reasons you describe, but I'm thinking of other protocols (e.g. Flatbuffers, Cap'n Proto, etc.)
The problem is that "sufficiently smart" does a lot of heavy lifting.

One way to look at the problem is to go build a sufficiently smart generic protocol and write down everything that's challenging to support in v1. You have tradeoffs between size (slow for slow networks), data dependencies (slow for modern CPUs), lane segmentation (parallel processing vs cache-friendly single-core access vs code complexity), forward/backward compatibility, how much validation should the protocol do, .... Any specific data serialization problem usually has some outside knowledge you can use to remove or simplify a few of those "requirements," and knowledge of the surrounding system can further guide you to have efficient data representations on _both_ sides of the transfer. Code that's less general-purpose tends to have more opportunities fore being small, fast, and explainable.

A common source of inefficiencies (protobuf is not unique in this) is the use of a schema language in any capacity as a blunt weapon to bludgeon the m x n problem between producers and consumers. The coding pattern of generating generic producers/consumers doesn't allow for fine-tuning of any producer/consumer pair.

Picking on flatbuffers as an example (I _like_ the project, but I'll ignore that sentiment for the moment), the vtable approach is smart and flexible, but it's poorly suited (compared to a full "parse" step) to data you intend to access frequently, especially when doing narrow operations. It's an overhead (one that reduces the ability for the CPU to pipeline your operations) you incur precisely by trying to define a generic format which many people can produce and consume, especially when the tech that produces that generic format is itself generic (operating on any valid schema file). Fully generic code is hard enough to make correct, much less fast, so in the aim of correctness and maintainability you usually compromise on speed somewhere.

For that (slightly vague) flatbuffers example, the "purpose-built protocol" could be as simple as almost anything else with a proper parse step. That might even be cap'n proto, though that also has problems in certain kinds of nested/repeated structures because of its arena allocation strategy (better than protobuf, but still more allocations and wasted space than you'd like).

Just because a company uses something doesn't mean all companies should. May as well use monorepos in that case
Trading companies are unusual in writing a lot of algo-heavy code. Did you assume every company was like this?

I can assure you they arent.

Even trading companies have a ton of project and code which you'll find at any reasonably sized tech company, the algo-heavy code is a small fraction of the total code they write. In this sense, they are not such an outlier just based on the business they are in - I think the use of a functional language, good tooling and education around PBT are much more important factors.
>the algo-heavy code is a small fraction of the total code they write

Wasnt the case in the trading firm I worked at.

Do you work at Jane Street? Have you worked elsewhere?