Hacker News new | ask | show | jobs
by glangdale 2674 days ago
One of the two authors here. Happy to answer questions.

The intent was to open things but not publicize them at this stage but Hacker News seems to find stuff. Wouldn't surprise me if plenty of folks follow Daniel Lemire on Github as his stuff is always interesting.

11 comments

I see that you are using MMX intrinsics directly, like _mm_sub_pi8, but you are never calling _mm_empty (https://software.intel.com/sites/landingpage/IntrinsicsGuide...) as required by the SysV AMD64 ABI (and pretty much all other ABIs out there).

I think the behavior of all the code that touches is undefined (it breaks the calling convention of the ABI), and while this often results in corrupted floating point values in registers, maybe you won't see much if you are not using the FPU. Still, since the function is inline, chances that this gets inlined somewhere where it could cause trouble seem high.

You might want to look into that.

Also, I wish this would all be written in Rust, there is great portable SIMD support over there. Might make your life easier trying to target other platforms.

EDIT: as burntsushi mentions below, that's not available in stable Rust, but if you want to squeeze out the last once of performance out of the Rust compiler, chances are you won't be using that anyways.

I would be extremely surprised if we were somehow accidentally using MMX; it's not our intention. It is my belief that we are using only AVX2, which, like the 19-year old SSE/SSE2 extension, has its own registers that are independent of the x87 floating point set.

If, once you review our codebase and verify that we are not inadvertently using a 22-year-old SIMD extension but still have undefined behavior, please write an issue on github.

I'm admiring Rust from a distance at this stage. I am comfortable enough with writing bare intrinsics and slapping a giant #ifdef around stuff.

> there is great portable SIMD support over there

It's not stable yet. The only stable SIMD stuff Rust supports is access to the raw x86 vendor intrinsics.

If they want to squeeze out the last ounce of performance out of the Rust toolchain it probably wouldn't make sense to use stable Rust anyways, so I don't think that's a big downside.

Also, they are already relying on "unstable" (non-standard conforming) C++ features (e.g. the code uses non-standard attributes behind macros, etc.). Using nightly Rust isn't worse than that per se.

Using Rust does have downsides. For the type of code they are writing, the main downside would probably be losing an alternative GCC backend, which might or might not be better than LLVM for their application.

Still, they would win portable SIMD and being able to target not only x86_64 but also ARM, Power, RISCV, WASM, etc., which is always cool to show in research papers.

I'm not suggesting that Rust is a perfect trade-off, only that it's an interesting one depending on what they want to do.

Sure. I'm just trying to be careful that we aren't going around advertising features that aren't stable yet without specifically saying that they aren't stable. It leads to a disappointing expectations mismatch.

I do think stable Rust is perfectly capable though. I don't generally target nightly Rust and am happy with how much I can squeeze out of it. :-) (Check out the benchmarks for the memchr crate, which use SIMD internally and should be competitive with glibc's x86_64 implementation that's in Assembly.)

Most programming languages don't have a stable / unstable distinction at all, and unless one is "in the Rust loop", stating something like "_unstable_ Rust can do X" won't probably mean what the reader think it means.

Unstable Rust sounds very dangerous, like something that breaks every day. Definitely more dangerous than stable Rust.

Yet if one is in the Rust loop, one knows that this is often not the case. I've been using some unstable features on nightly, like const fn, specialization, function traits, etc. for years (3 years?), and I've never had a CI build job fail due to a change to the implementation of these features.

Yet some features in stable Rust like Rust2018 uniform_paths or stable SIMD have caused many build job breaks and undefined behavior due to bugs in the compiler over the last months.

So whatever stability means, it does not mean "using this feature won't result in your code not breaking". It also doesn't mean "you have to use a nightly toolchain to use the feature".

An unstable Rust feature is more like a "compiler extension" in C / C++. It is just something that hasn't fully gone through the process of standardization.

I don't think it is a fair characterization that code that uses this extensions is not Rust. Pretty much all C++ code uses compiler extensions, and nobody says that this code is not C++ just because it uses one of them.

Explaining all of this when telling someone "Rust is a technology that allows you to solve problem X nicely" isn't helpful.

Many people vocal about Rust seem to think that Rust is the end in of itself. The goal isn't solving a problem, but using Rust to solve it. I see many of these people argue that unstable Rust isn't Rust, and that people should be using stable Rust etc. For most people, using Rust isn't the goal, solving their problem is. Whether one or many compiler extensions have to be enabled for that is pretty much irrelevant to them. Sure it would be nice if one didn't need to do that, but it isn't a big deal either. The big embedded community is living proof of that. Only a small minority of this community cares about the language enough to participate in its evolution. Most people don't care enough about that, they have more interesting problems to solve.

You happened to pick some features where there hasn’t been much development work, since other things were prioritized. And one feature that’s not mostly compiler internal.

This is not the general case for unstable features. And promoting the use of them too heavily can cause a lot of problems. It undermines trust in the language, especially given rust’s pre-1.0 reputation (which was well deserved at the time.)

Stuff that’s unstable isn’t in Rust; that’s why it can be changed or even wholesale removed at any time. The distinction is very important.

I'm writing an IoT library for devices with tiny microprocessors and have been sending data as JSON or BSON (binary JSON). On the backend, I've been storing reports from IoT devices into a database (MariaDB on AWS). How crazy would it be to just store all the data as JSON files on disk (or S3 bucket) and then batch process them when I need to perform data analysis on them? If a million devices sends dozens of status reports per day, that's going to be a crapton on files... but that might be faster to process than querying the database.

If you or anyone else has some opinions on this, please let me know! I'd really like to learn how people do this type of analysis at scale.

reading lots of small files on s3 or local filesystems is tricky. a million devices with one dozen files, so lets say 12 million files.

One thing locally is each file takes up a full block. So even if you only need 500 bytes of data in a file, and a block is 4kb, youve wasted 3.5kb of space and IO. Multiply that by a million and youre wasting gigabytes of space.

In S3, listing 12 million files takes 12 thousand http(max return is 1000 items). So that would take two minutes if you assume its 10ms per round trip. Let's say you wanted to read each file, and again each read takes 10ms.. youre looking at 1.4 days. Obviously this can be parallelized, but when you look at the raw byte size this is a huge overhead, and this is just to read one day of data.

If you concatenate the files together to get a reasonable size and number of files, raw json on s3 is really powerful. Point athena at it, and you just write sql and it handles the rest, and is serverless. But it does make single row lookups more expensive(supplementing with dynamodb could keep it serverless if single row lookups are frequent).

lots of optimizations will get improvements, like parquet that tobilg mentioned(binary format and columnar), but anything with a decent file size will work.

Yeah, this is what Kinesis Firehose is for. Send all of your messages there and it will batch them to S3.
You may enjoy this:

The best way to not lose messages is to minimize the work done by your log receiver. So we did. It receives the uploaded log file chunk and appends it to a file, and that's it. The "file" is actually in a cloud storage system that's more-or-less like S3. When I explained this to someone, they asked why we didn't put it in a Bigtable-like thing or some other database, because isn't a filesystem kinda cheesy? No, it's not cheesy, it's simple. Simple things don't break.

https://apenwarr.ca/log/20190216

We‘re using AWS Kinesis delivery streams to batch incoming JSON messages from IoT devices to Parquet files in S3. Those can directly be read by different AWS services like Redshift, EMR or Athena...
We use Athena for all our robotics data, which we ETL into JSON. It's fantastic for queries that are simple time-slice queries, as most are because sensor data is inherently time-series. When more complicated joins are necessary, the performance is there across terabytes, and the cost is very very low, $5 per terabyte scanned (storage costs are another thing).
What bothers me about Kinesis is that it is prohibitively expensive at scale if you don't compress your data before putting it to Kinesis.

But if you want to use the nice features like parquet conversion your data can't be compressed.

If it could handle compressed data at the same price I would use a lot more of it.

You’ve kinda just described AWS Athena.
This comment needs to be higher up; Amazon has a service for doing just this, dumping 'dumb' files (like json, csv, etc) into S3 buckets and performing SQL queries on them. No need to have to think about how to store things for future querying.
I've used Athena really effectively to solve similar problems. If your data storage is relatively small and/or your queries relatively infrequent, JSON can be a good fit. As one of those dimensions expands, you can decrease costs/increase performance by converting to Parquet and compressing.
I am replying to you as an engineer at an IoT company that provides SaaS in AWS for the data our devices produce. To solve this problem, we transmit our data in a proprietary "raw" binary format that then gets parsed into a protobuf. All data for a given UTC day is appended to this protobuf file and hosted in S3. Retrieving data requires downloading the protobuf file from S3, unmarshalling the protobuf, and finding the entry you care about.
If you are considering using plain files instead of a DB server, you could try a compromise and use an embedded key-value store like RocksDB, LevelDB, BadgerDB etc.

It's local storage only, limited query capabilities depending on the DB, but should be extremely fast.

Why not use a timeseries database, like http://btrdb.io?
well if you need indexed lookups, then use a database

if you're doing "table scan" processing of entire datasets, sure just-a-bunch-of-files would work too.

Databases can be surprisingly fast for things like that, since high performance file i/o is full of tricky/annoying stuff that databases have already optimized for.

Depending on your size / budget / needs Snowflake may interest you. https://www.snowflake.com/product/architecture/.

I haven't used it but have been given a presentation by them on it, and it was very very good.

They store data in S3 and use FoundationDB for indexes. You can feed it JSON and it'll index it and let you query it on a massive scale shockingly fast.

Obviously they are not aimed at small hobby projects but if your project has money / serious product depending on your needs it's well worth looking at.

On the S3 cheaper / smaller end you can batch up data daily / weekly etc. So the landing bucket acts as a queue that gets processed creating daily batch files from the small files aggregated together. You can then take the daily batches to create weekly batches etc etc, essentially partitioning. This will reduce the total number of files needed to query. If you use deterministic names based on how you plan to query this can also reduce the number of files you need to list / parse. When batching / re-partitioning the data you can also use the Apache Parquet format to compress a little better + also import in some of the querying tools out there.

I've written my fare share of performant code over the years, but this is some next level shit. I've been reading it the last few hours. The only question I have is what is the term for that place considered two degrees past black magic? Since you live there, I have to assume you know the name.
It's not magic. The things that enable writing this kind of code are essentially practice and specialization. Most people have to write code that works all all architectures and where performance is probably less critical than having a simple, workable codebase - so the opportunities to practice writing SIMD code are rare under those constraints.

Unfortunately, the fragmentation of SIMD standards and various pitfalls in implementation (the much ballyhoo'ed "running AVX will make your processor clock to half its speed or something" exaggerations, for example) make a lot of people nervous about putting in the time to commit to developing expertise, which is a shame.

Not really a question, but if you ever get to the point of wondering what a good next challenging project would be, consider generalizing some of these techniques into a next generation Yacc / Bison replacement.

Something that can take generic grammer rules and turn it into a high performance parsing engine.

It wouldn't have to support every possible grammar or option. Json isn't that complex of a language, but even a limited set of grammar options in exchange for a performant parser could be of benefit for a very large set of problems.

It's on the list as a research project. It's not obvious to me at this stage that the bottlenecks for more advanced parsers are necessarily going to be in the same place as they are for JSON. It might make more sense to look at a state-of-the-art parser and see if we can contribute a few tricks instead.
That sounds interesting. Where is the best place to follow your future work? Your & Daniel Lemire's Github, or elsewhere?
I might go so far as to post to branchfree.org, and Daniel posts at https://lemire.me/blog/ so either of those, plus github, ought to cover it.
I'm just starting to look at Tree-sitter; that might qualify as a state of the art parser that could use a few tricks.
Oh now that would be an interesting tool
Any chance to have a similar thing for s-expressions? I parse GBs of them and Common Lisp reader is very slow.
Probably not too hard. It would come down to how easy it is to detect quoting conventions so you don't accidentally parse () chars in strings. JSON is medium-easy. I don't know where the canonical definition of s-expressions you're using comes from (is it just Common Lisp?) so I don't know how this works.

We'd like to have some more examples of formats people care about - I'm interested in generalizing this work. So if you want to followup with more detail please do.

As a clojure user, I care about EDN, but its probably too niche to spend your time on.

https://github.com/edn-format/edn

Yes!!! A generalization for other kinds of simple grammars would be awesome.

On another note. As a js programmer who deals with a ton of json, I would love v8 to adopt some of the tricks into their json parser.

Any technical blog articles you have that explain how you were able to ascertain these incredible performance gains?

Kudos on some incredible work! :)

Thank you. More description of the work will be forthcoming but please be patient (for non-sinister reasons).
Jsmn it's already pretty fast and simple. How the hell can be a lot faster than that? I'm very curious.
The big difference between RapidJson and sajson is surprising to me. When I benchmarked them their performance was comparable: https://github.com/project-gemmi/benchmarking-json . Did you use RapidJson in full-precision mode?

By the way, nativejson-benchmark (from RapidJson) has a nice conformance checker that tries various corner cases. But you probably know it.

More performance details beyond what's on the site will follow (in a while).

We use RapidJSON in the high-performance mode not the funky mode that minimizes FP error (which is some astounding work - I had no idea that strtof was so involved!). Number conversion is not our #1 focus - doing it well is nice, but all implementations have access to the same FP tricks, so you don't really learn much by going wild on this aspect.

At least, you don't unless FP conversion is your focus, in which case you should share your FP conversion code with everyone!

You should take a look at std::from_chars IIRC it can completely destroy other parsers within the stdlib because it's not intended to take locale into consideration.

https://en.cppreference.com/w/cpp/utility/from_chars

I recently saw people using gpu to parse csv files. there are also other articles on using gpu to parse json. do you think if gpu can perform well on this type of tasks?
I'm not aware of an article that covers a actual implementation or that has a benchmark of performance. As for GPGPU: it's possible. Our first stage of matching is very parallel. But Amdahl's law would, of course, suggest that the serial parsing step would dominate.

I'm interested in this: some aspects of our very serial 'stage 2' (the parsing step) could be made parallel. This would be very interesting. Unfortunately I personally cannot be made parallel, so working on this needs to go into a big queue with a lot of other work.

How hard would it be to extend the parser to handle arbitrary-precision numbers? Strictly speaking the JSON spec does not require numbers to fit into 64-bit ints / doubles.
Daniel Lemire did most of the work on the number handling, but our general approach was to try to do work that's similar to what the bulk of other libraries do. I believe pretty much everyone throws oversize numbers on the floor.

I don't think it would be hard at all; it would just be extra effort that wasn't needed to run obvious comparisons.

Jackson benchmarks? I've heard it's twice as fast as rapidjson.
Why did you decide to write this? What was the motivation?
Honestly? I was trolled into it. :-) Unemployed people do weird things.

I can't speak for Daniel's motivation.

Any plan for wrapper for android?
This would imply an ARM port, I guess, as x86 android isn't much of a thing anymore AFAIK.

I don't think either of us know much about android - not enough to do that. But an ARM port is very interesting.

Since I'm no longer an Intel employee I don't see why I shouldn't skill up and do a Neon port (I got interested in SVE, but since ARM doesn't seem to want to bother releasing cores that run SVE, I'm not going to go too far down that path right now). Neon, on the other hand, is in tons of places. As far as I know all the required permutes, carryless multiplies and various other SIMD bits and pieces are there on Neon. So it's a simple matter of porting.