These have existed for a long time. They used to be called sound cards and maybe still are, I've been calling them audio interfaces for the last decade, and getting what you need in this space is an exercise in defining your actual requirements, because there is no standard for what's important and what's not. A good one can do wonders for latency when doing a lot of DSP.
That's not what parent is asking for (especially given the context). Sound cards don't perform "general purpose DSP" for "Max/MSP, etc", they focus on input, output, and closely related tasks.
I think the answer would essentially have to be a full-fledged computer running something like the AudioGridder server. Except ironically that still wouldn't help you run generic "Max/MSP, etc", it would only run the plugins supported by AudioGridder (VST3/AU). The scope would necessarily be limited by software, because that software usually expects to be running on your CPU.
There are a couple of audio interfaces that have the potential. UA interfaces have DSP chips that can run (sort of) general purpose stuff that could be leveraged by Max/MSP if the SDK were open (it may be, I haven't looked). Also, RME interfaces often have an FPGA in them which I've often thought would be a useful co-processor for audio, but I'm pretty sure that they aren't user programmable either.
Basically, the hardware has been around for ages, but the software is non existent/limited due to the vendors not fully realising the potential of the hardware, and reverse engineering this stuff is really hard!
> that software usually expects to be running on your CPU
It's implied that software would have to be rewritten to support this new device, like how all graphics software was rewritten to run on GPUs when they first appeared.
Can you describe a little better what this box's purpose would be (scenarios most relevant to the development becoming a commercial success), because a cursory glance at the real-time audio part of Max/MSP (and the Pd (Pure Data) fork/clone) suggests that specialized audio DSP hardware would not be beneficial, compared to using a CPU/distributing it over multiple cores, or potentially even a GPU.
You should be able to use seL4 as a hypervisor and stuff a GNU/Linux system inside. The actual low-latency work would be done via native seL4 processes. It's proven to have hard latency bounds, thus being suitable for hard-realtime applications (except for modern x86_64 CPUs having special interrupts that can't be disabled, and thus possess the capability to introduce latency spikes of potentially unbounded duration). The HFT community found ways around those issues, however. It wouldn't be good enough to control a manned aircraft, but for entertainment-related audio, it should easily be good enough (those spikes are around a millisecond or so, iirc).