Hacker News new | ask | show | jobs
by Twinklebear 2254 days ago
SIMD is used a ton in rendering applications and starting to see more use in games too (through ISPC for example).

I'd add to the list:

- Embree: https://www.embree.org/ Open source high-performance ray tracing kernels for CPUs using SIMD.

- OpenVKL: https://www.openvkl.org/ Similar to Embree (high-performance ray tracing kernels), but for volume traversal and sampling.

- ISPC: https://ispc.github.io/ an open source compiler for a SPMD language which compiles it to efficient SIMD code

- OSPRay: http://www.ospray.org/ A large project using SIMD throughout (via ISPC) for real time ray tracing for scientific visualization and physically based rendering.

- Open Image Denoise: https://openimagedenoise.github.io/ An open-source image denoiser using SIMD (via ISPC) for some image processing and denoising.

- (my own project) ChameleonRT: https://github.com/Twinklebear/ChameleonRT has an Embree + ISPC backend, using Embree for SIMD ray traversal and ISPC for vectorizing the rest of the path tracer (shading, texture sampling).

4 comments

> starting to see more use in games

Starting to see? Back in Ye Olde 586 Days of the late 1990s, MMX was added to the Pentium architecture pretty much exclusively for 3D games and real-time audio/video decoding. (This was back when the act of playing an MP3 was no small chore for the average consumer CPU.) Intel made quite a big deal over MMX including millions of dollars in TV ads aimed at the general population, despite the fact that software had to be built specifically to use MMX and that only certain kinds of software could benefit from it.

MMX had nothing to do with games! It was a part of Intel _marketing scam_. https://news.ycombinator.com/item?id=19468837 :

"MMX was useless for games. MMX is Integer math only, good for DSP, things like audio filters, or making a softmodem out of your sound card. Unsuitable for accelerating 3D games. Whats worse MMX has no dedicated registers, and instead reuses/shares FPU ones, this means you cant use MMX and FPU (all 3D code pre Direct3D 7 Hardware T&L) at the same time. ... Funnily enough AMDs 1998 3DNow! did actually add floating point support to MMX and was useful for 3D acceleration until hardware T&L came along 2 years later.

Intel Paid few dev houses to release make believe MMX enhancements, like POD (1997)

https://www.mobygames.com/images/covers/l/51358-pod-windows-...

1/6 of box covered with Intel MMX advertising while game used it only for some sound effects. Intel repeated this trick in 99 while introducing Pentium 3 with SSE. Intel commissioned Rage Software to build a demo piece showcasing P3 during Comdex Fall. It worked .. by cheating with graphic details ;-) Quoting hardware.fr "But looking closely at the demo, we notice - as you can see on the screenshots - that the SSE version is less detailed than the non-SSE version (see the ground). Intel would you try to roll the journalists in the flour?". Of course Anandtech used this never released publicly cheating demo pretending to be a game in all of their Pentium 3 tests for over a year.

https://www.vogons.org/viewtopic.php?f=46&t=65247&start=20#p... "

MMX was one of Intel's many Native Signal Processing (NSP) initiatives. They had plenty of ideas for making PCs dependent on Intel hardware, something Nvidia is really good at these days (physx, cuda, hairworks, gameworks). Thankfully Microsoft was quick to kill their other fancy plans https://www.theregister.co.uk/1998/11/11/microsoft_said_drop... Microsoft did the same thing to Creative with Vista killing DirectAudio, out of fear that one company was gripping positional audio monopoly on their platform.

Indeed it's been a workhorse since forever. All the consoles from PS2 forward have included some form of SIMD and it is used extensively.

Here's a GDC 2015 article about SIMD at Insomniac Games. https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afr...

> ISPC: https://ispc.github.io/ an open source compiler for a SPMD language which compiles it to efficient SIMD code

I've been learning ispc lately and it does seem like a wonderful solution, you avoid having to build separate implementations for every instruction set and/or worrying about per-compiler massaging to get it to recognise the vectorisation opportunities. The arguments for having a domain-specific language variant and why it was written (https://pharr.org/matt/blog/2018/04/30/ispc-all.html is a good read) seem like persuasive arguments.

However, outside of the projects in the above list - it doesn't seem to have very wide usage. There are still commits coming in/responding to some issues so it doesn't seem dead, but there are many issues untouched or just untriaged. There isn't much discussion about using it, or people asking for advice. The mailing list has about a message a month.

Is it merely just an extremely highly specialised domain? Is it just that CUDA/OpenCL is a more efficient solution for most cases where one would otherwise consider it? Are there too many ASM/intrinsic experts out there to bother learning?

ISPC is really awesome, but you're right it is much less known than CUDA/OpenCL. Part of that might just be lack of marketing effort and focus (you don't hear much about it compared to e.g. CUDA) and the team working on it is far smaller than that on CUDA. There has been some wider adoption, like Unreal Engine 4 using it now: https://devmesh.intel.com/projects/intel-ispc-in-unreal-engi... which is super cool, so hopefully we'll see more of that.

As far as support from other languages I did write this wrapper for using ISPC from Rust https://github.com/Twinklebear/ispc-rs (but that's just me again), and there has been work on a WebASM+SIMD backend which is really exciting. Intel does also have an ISPC based texture compressor (https://github.com/GameTechDev/ISPCTextureCompressor) which I think does have some popularity.

However, the domain is pretty specialized, and I think the fraction of people who really care about CPU performance and are willing to port or write part of their code in another language is smaller still. It's also possible that a lot of those who would do so have their own hand written intrinsics wrappers already. Migrating to ISPC would reduce a lot of maintenance effort on such projects, but when they already have momentum in the other direction it can be harder to switch. I think that on the CPU ISPC is easier and better than OpenCL for performance and tight integration with the "host" language, since you can directly share pointers and even call back and forth between the "host" and "kernel".

ISPC’s creator Matt Pharr works at NVIDIA, they have a series of blog post explaining the history of ISPC.
At work, I had a project involving a DSL for Monte Carlo simulations. The DSL was an internal DSL in Scala, our interpreter was in Scala, and we transpiled to ISPC (for servers/VMs that didn't expose a GPU) and OpenCL.

I generally liked ISPC, but I really didn't like that it tried to look as close as possible to C but departed from C in unnecessary ways. With Monte Carlo simulations, we deal with a lot of probabilities represented as doubles in the range [0.0, 1.0]. The biggest pain is that operations between a double and any integral type cast the double to the integral type, whereas in C, the integral type gets implicitly cast to a double. I understand the implicit casting rules were changed to give the fastest speed rather than minimize worst-case rounding error. I could understand getting rid of implicit casts, or maybe I could understand changing rules to improve accuracy and know that the user could easily use a profiler to discover any performance problems this caused. However, in our case, uint32_t * double = (uint32_t) 0, which then would get implicitly cast back to a double if being assigned to a variable. My interne was beating his head against the wall for the better part of an afternoon before I gave him a bit of debugging help. All of his probabilities were coming out 0% and 100% for his component.

I actually emailed the authors with a bug report when I found the implicit casting rules differed so radically from C and were in the direction away from accuracy. (Note there's no rounding error when converting uint32_t to a 64-bit IEEE-754 double.) They were very nice, and pointed us to where this behavior was documented.

If you're going out of your way to make your language look like C and interoperate seamlessly with C, you should have really strong justifications for the places where you radically depart from C's semantics.

> However, outside of the projects in the above list - it doesn't seem to have very wide usage.

ISPC is pretty popular in the HPC world.

Is it? I haven't heard about it actually being popular anywhere. It definitely works well, but I haven't seen it talked about much except in the case of embree, Intel's ray tracing library. It doesn't seem like there is any funding for it, though it actually works so well already it doesn't seem to need big leaps in progress to be valuable.
> Is it? I haven't heard about it actually being popular anywhere.

I know 3 simulators running on supercomputers in the neurosciences domains that use it + some graph processing over supercomputers tools.

It is true that is is not extremely well known, but it is used.

That's great, but you said it was popular in the HPC world. I would love that to be true, but I don't know of a way to see the big picture.
Adding nnn: https://github.com/jarun/nnn A terminal file manager.

It takes advantage of SIMD at -O3 level of optimization in it's custom string copy function: https://github.com/jarun/nnn/blob/bc7a81921ed974a408d4de2cbf...

The function is used extensively in the program.

I don’t see why you’d write one of these yourself when the one in the standard library is probably vectorized too, and better…
Not necessarily. There are implementations which don't even take advantage of 4/8 byte copying. We wanted to have something uniform. But yes, you are right with glibc or macOS.

Also, from the strncpy man page:

   strlcpy()
       Some systems (the BSDs, Solaris, and others) provide the following function:

           size_t strlcpy(char *dest, const char *src, size_t size);

       This function is similar to strncpy(), but it copies at most size-1 bytes to dest, always adds  a
       terminating  null  byte,  and  does  not pad the target with (further) null bytes.  This function
       fixes some of the problems of strcpy() and strncpy(), but the caller must still handle the possi‐
       bility of data loss if size is too small.  The return value of the function is the length of src,
       which allows truncation to be easily detected: if the return value is greater than  or  equal  to
       size,  truncation  occurred.  If loss of data matters, the caller must either check the arguments
       before the call, or test the function return value.  strlcpy() is not present in glibc and is not
       standardized by POSIX, but is available on Linux via the libbsd library.
Why not call strncpy or memcpy rather than exhibiting undefined behavior?
> Why not call strncpy

Read the excerpt.

> undefined

Nothing's _undefined_ there.

I'm not sure which specific excerpt you're referring to, but I have a good idea of the many functions that libraries have come up with to sling characters from one buffer to another, plus I read your implementation and the man page snippet you linked above. I'm still not seeing why you can't replace the code between lines 881 and 902 with one of the appropriate copying routines; you quite literally have a source, destination, and length and you can fix up the last NUL byte right after the call. The standard library's function will be vectorized regardless of how your compiler was feeling that day, and it's probably smarter than yours (glibc, for example, does a "small copy" up to alignment before it launches into the vectorized stuff, rather than skipping it entirely if the buffers aren't aligned). And your function does have undefined behavior: you pun a char * to a ulong *.
Hi, thank you for the pointers!

I try not to include C or C++ projects other than for educational purpose (like the Mandelbrot set) because one of my life's goal is to help the world to transition to a C & C++ free world (other than for kernels...).

I believe that my role is to promote projects which are "building the new world" and thus we need to abandon and port all form insecure core.

So in an article about high/extreme performance systems, you're ignoring the vast majority of them because you don't agree with the tool used to achieve said performance? What..?
I guess because using other programming languages proves the point that there are other approaches, instead of reinforcing the status quo.
Exactly this
I believe that performance is irrelevant without correctness and security.

My opinion is that C and C++ can't bring enough security and correctness guarantees for mere mortals (lack of tooling, language features...).

Yes some correct and secure programs are written in C and C++ but it's not and will never be the norm.

By the sake of god, please stop to put C and C++ in the same basket when talking about security.

It just show you do not know what you are talking about.

Most security problems affecting C program DO NOT affect C++ programs.

Stack smash, vla abuse, string null termination problems, goto error control, double free corruption do NOT affect C++, they are C specific.

Unfortunately they surely do, because a large set of developers writes C++ code full of C idioms.

Which is why Google has thrown out the towel and Android 11 will require hardware memory tagging for native code, and now everything is compiled with FORTIFY enabled.

Also Microsoft research shows otherwise, https://msrc-blog.microsoft.com/2019/07/16/a-proactive-appro...

> ~70% of the vulnerabilities Microsoft assigns a CVE each year continue to be memory safety issues

So yeah, you are correct that C++ does offer the tools not to write C like security holes.

Now you just need to convince a large spectrum of companies to actually stop doing C idioms while writing C++ code.

> Unfortunately they surely do, because a large set of developers writes C++ code full of C idioms.

That's an other problem, not technical but educational. A lot of (older) programmer came to C++ passing by C and continue to use C in C++.

That need time, education and guidelines to change that... a lot of time.

Changing mindset and programmer education is sometimes harder than changing the program itself.

> Now you just need to convince a large spectrum of companies to actually stop doing C idioms while writing C++ code.

That is already ongoing. However do not forget that C++ has a bagage of 25 years of code pre-C++11 to upgrade before arriving there.

While I mostly agree, plenty of companies aren't going to change their coding, and outsourcing practices, until they hurt their button line.
C++ is too large and huge to not shoot yourself in the foot (or of your user's) in one way or another.
This argument has been debunked 20 times already.
And 20 times more in security reports from Microsoft, Google and Apple.
Despise of C and an interest in high-performance, a unique mix.
Back in the 80's C was anything but high performance.

Only with people willing to challenge the status quo do we move forward.

It was always higher performance than e.g. Pascal or Basic on any relevant platform (the cost was lack of error checking, e.g. array bounds).

And it was slower than FORTRAN on most 32-but platforms such as DEC, Sun and IBM Unix workstations, VAXen and mainframes - but it was still the speed king on the most prevalent platform of the time, 8086/80286 and friends.

Only as urban myth scattered around by the C crowd.

As user from all Borland product until they changed to Inprise, it was definitely not the case. Pascal and Basic compilers provided enough customization points.

When one of them wasn't fast enough versus Assembly, none of them were.

I used to have fun showing C dudes in demoscene parties how to optimize code.

Now, if you are speaking about the dying days of MS-DOS, when everyone was jumping into 32 bit extenders with Watcom C++, then we are already in another chapter of 16 bit compiler history.

I used TP from 3.0 to 7.0 and a little bit of Delphi 1, and contemporary Turbo C; I dropped to assembly often, dropped TP bound checking often, and was well aware of all these controls.

Parsing with a *ptr++ in TC was not matched by TP until IIRC v7; 16 bit watcom often produced way better code than either TP or TC.

And, as you say, indeed when speed was really needed, you dropped to assembly; no compiler at the time would properly generate “lodsb” inside a loop, although watcom did in its late win3 target days IIRC.

This doesn't line up with the reality that almost all games were written in combinations of C and asm.
In my opinion, we should instead focus on hardware and experiment more with different kinds of cpus, memory, co-processors etc. The key to newer software systems are newer kinds of hardware, for which you can write newer experimental systems in the language of your own designs.

The sky is the limit, and there is so much to do! Transactional memory, massively multicore computers, hardware built on predicate logic, neuromorphic computers, and whatnot.

We are still mostly stuck with the cpu and memory designs of old.

Some of the most secure software is written in C.

The language matters less than you’d think once you get past a certain correctness baseline.

I have not doubts that secure software can be written is C, but it's not the norm and it's too easy to introduce vulnerabilities in C for mere mortals.
I mean, have you even _seen_ the trail of CVEs that Java has left in its wake over the past few decades?