Hacker News new | ask | show | jobs
by seminatl 2399 days ago
Why do all the proposed avenues of future investigation, and all of the current comments on this thread, focus on voodoo instead of the far more likely explanation that the display driver is just stomping on the memory of the network interface? If there's software anywhere in a system, 99% of the time that's the problem.
16 comments

This is not true when radios are involved. In my experience, wireless connectivity issues are rarely caused by software; the problem is much more often caused by interference.

The interference can be internal interference in the device, or interference from other wireless devices. In many cases, the problem are even devices that shouldn't emit RF at all, like power supplies, switches, light bulbs...

Another common issue is poor antenna design (eg. attenuation when you hold the device, or strong directionality of an antenna that should not be directional).

And, last but not least, physical obstacles. Most people understand that concrete walls with rebar will block signal, but a surprisingly large number of people try to use aluminum stands or cases for devices with wireless radios.

All those factors will cause connection issues, and they are really common because debugging them is so hard (who has a spectrum analyzer at home? How do you find out which one of dozens of electronic devices is emitting RF that it shouldn't?)

In addition, the linked forum thread includes a user describing how high resolutions break 2.4GHz networks for them, but 5GHz networks work fine. The display driver is stomping on memory responsible for 2.4GHz, but not 5GHz? I'm really not seeing that as the more likely problem here.
5GHz WiFi has more bandwidth than 2.4GHz, so typically will involve larger IO buffers in the driver, which could easily be enough to expose a memory scribbler (I imagine there's a bunch of other features that are enabled/disabled by the frequency band switch too). However, I think asdfasgasdgasdg's answer is the correct reason not to suspect a memory scribbler - ie a memory scribbler would cause the driver to crash/fail and the kernel would log a message.
Remember the Pi has an odd architecture and all the IO passes through the GPU. The GPU doesn't log human readable messages anywhere. There's a good chance the GPU did log a crash or failure, but only broadcoms engineers can see it.
Er, really, wow. I didn't know that. Can you point me at some more info for that? Surely the GPIOs don't go via the GPU.
It's a BCM2711, and the datasheet is NDA only - typical Broadcom!

The VideoCore (Broadcoms GPU) is the main processor on the thing, and the cluster of ARM cores that run Linux are more of a coprocessor which can only see some of RAM.

But 5 GHz doesn't fail, only 2.4 GHz does.
This is exactly abainbridge's point.
How do you mean?

> 5GHz WiFi has more bandwidth than 2.4GHz, so typically will involve larger IO buffers in the driver, which could easily be enough to expose a memory scribbler

He's saying 5 GHz will expose the scribbler, and the opposite is happening, only 2.4 GHz fails.

@StavrosK Thanks for wading in in my defence, but I had actually mis-understood the situation :-)

Although, if my theory that the IO buffers are different sizes is true, then that could perturb memory layout enough to expose/hide the bug in either direction.

So the display driver is meant to be mutating memory also owned by the network controller, but not in a way that causes a crash, log messages, or a kernel panic? That doesn't seem so likely to me. I mean it's not impossible but it's rare to see memory corruption/interference cause a clean breakage like this. In my experience it usually causes things to become extremely funky for a short while, then a crash.
Every SoC I've dealt with containing a WiFi core has a dedicated coprocessor (RPU is a common name, depending on vendor) running its' own firmware. So more likely, _that_ core would go funky, then crash. The kernel might have code to recover that, but I doubt it, and it certainly would complain the whole way as you say.
In the Pi, the coprocessor is the GPU, and it is the first to initialize on boot and runs all the firmware-like stuff and handles all IO and does memory allocations/mappings.
It's disrespectful towards the art of Voodoo to call it "RF".
It's also disrespectful to the black art of RF to call it mere Voodoo :-)
Because what if it's not? My first thought is that the HDMI is radiating and interfering with the wifi antenna.

As an embedded engineer, it was a hard lesson for me to learn that not all issues are software issues and the hardware may need to be investigated. This is especially true where there is different behaviour between units. You can't just assume that your 99% estimation (plucked out of thin air) is correct and discredit other potential explanations.

> Because what if it's not?

Then, after you done ruling out the most likely and easiest explanation to test, you can then start exploring the remaining possibilities. Skipping to the more exotic explanations sounds more interesting but it's poor use of time if there's still low-hanging fruit out there.

maybe, with high frequency radios and improperly shielded cables and chips, the most likely scenario is RF interference?
Improper shielding is an assumption with no evidence as yet. I also mentioned that the ease of verifying the explanation should be a factor. Changing software is usually very easy.
It's so common that it's not an unlikely starting point. EMC is a major issue in high frequency electronics design and the raspberry pi had a history of having to redesign certain parts because of not having enough shielding.
I can find [1] on the subject which is quite interesting.

[1]: https://www.element14.com/community/people/PeteL/blog/2012/0...

wrapping tinfoil around an hdmi plug/cable isn't particularly hard either :) chips are harder but at least you rule out the cable. HDMI cables are ridiculously finicky if you've ever tried to get anything more than the lowest common denominator 1080p going on them.
I don't agree that wrapping foil is a great way to 100% rule that out as there is room for error. Using different cables/dongles would be better and they already tried that.
> If there's software anywhere in a system, 99% of the time that's the problem.

Unless USB is involved, then it's something in the USB stack...

USB isn't up to spec on the pi4
Only the power bit, and only one resistor...
Not just that one resistor, they also lack the circuitry to prevent feeding power to that port when powered through other means like PoE.
There's several small scale WiFi chips that share clock source with USB - it would be unsurprising to find that the WiFi and video interface are sharing the same clock, so drawing too much from either could directly effect the other.

These kinds of problems are common in embedded computers, like the Pi. Just as common as software.

Clocks will all be buffered due to physical distance between the GPU and WiFi IP core on the SoC so it's unlikely to be a clock loading issue.
Buffering isn't really the problem I was talking about, it was more the shielding of the clock.
For future reference this "Voodoo" is referred to technically as electrical engineering ;)
I don't know much about the Raspberry Pi, but it looks like they chose an ARM core variant without IOMMU, so this might actually be plausible, even though it's such a computer architecture anti-pattern to share system memory DMA across devices.
Can you list which ARM cores you know of that include an IOMMU? I’m personally unaware of any, as that is typically bundled as a separate IP package that must be integrated separately into the system, and is usually customized based on the number of supported masters that require virtualization.

E.g. the Xilinx ZynqMP includes the same Cortex-A53 complex the Raspberry Pi 3 has. They also included CCI-400 coherent interconnect switch to it, and also included the SMMU-500 IOMMU that partially interfaces with the A53 interconnect, but is effectively independently programmed and also controls access to DDR3/4 from the SATA, Displayport and PCIe controllers.

Per the original topic, have they released a full datasheet/reference manual for the Pi 4 SoC yet? I’ve yet to see one other than a VERY high-level overview of it’s new pieces.

> have they released a full datasheet/reference manual for the Pi 4 SoC yet?

Ha. It's Broadcom... They're never going to release one.

Huh, so that's why the iPhone 6s's SecureROM memory regions weren't MMU-locked... IOMMU doesn't come in ARM by default! So you have to wire it up yourself (in your own IP blocks), and then hook it up in software everywhere you want it to work.

And all that costs extra developer time -and money.

Heh.

http://ramtin-amin.fr/#nvmedma

Best bet is probably the device tree.
What does "stomping on the memory of the network interface" mean?
Really, terrible, security vulnerability issues? DRI and networking kennel modules should absolutely not be able to interact with each other at all.
"kernel module" together with "should absolutely not be able to interact with each other" are an impossible requirement with Linux.

I think the other operating systems available for the Pi are roughly in the same boat (Windows & RiscOS). There was a nascent Minix port at some point, I wonder if it was abandoned.

Linux is (currently) a monolithic kernel and I'm not sure that can be accomplished without changing this.
The screen memory is taking up so much RAM that it's overlapping with regions of memory the network interface uses.
Resource are allocated via the kernel - it won't hand out overlapping address ranges.
Maybe the misbehaving driver is writing past the end of its requested space though, inadvertently? (I don't know if this is always called a "heap overflow" or if that's just Clang AddressSanitizer.)
Or something like https://mjg59.dreamwidth.org/11235.html is happening.
That resulted in a wide variety of different failures, from the kernel oopsing to various userspace components crashing. It would be very unusual to have unexpected DMA trigger such a specific failure.

(for avoidance of doubt, I wrote that blog post)

Out-of-bound memory write.
Why would that only interfere with the network driver, rather than tending to crash random userland or crash the kernel?
I don't know, let's see if anyone has an idea about it.

I was just explaining what the OP was asking for. I personally believe it's a EMI-related hardware issue.

I don't agree with how likely this is given the specificity of the bug, but should be super simple to test.

Try to reproduce with a different OS/kernel.

https://twitter.com/assortedhackery/status/12000566338980290...

Actual measurement that a Pi with HDMI at the affected reoslutions radiates over the bottom end of the wifi band.

Mostly because of a known history over the past couple years of USB, WiFi, and/or HDMI causing direct interference with each other. See lots of other comments upthread about similar RF issues people have had, stretching all the way back to 486 laptop keyboards :)
EMI is a headache I deal with daily, on far more sensitive receivers, so voodoo is likely. Though just moving the unit next to the AP (increasing RX signal strength) is an easy diagnosis.
true
It certainly sounds more like a software issue than some arcane effect from RF interference or the like. Could be memory getting smashed, a bus getting saturated, an interrupt not getting serviced, or any similar thing.
Meh. I've done low-level embedded/mobile for a long time now. This actually sounds like a totally reasonable RF interference issue. 2.4Ghz is funky & has desense issues with lots of internal busses (not a HW engineer so not sure why that band specifically). Also radios typically have to accept interference which means the radio would "stop working" rather than causing the display to work weirdly (ironically a much easier failure mode to display/diagnose/notice).
when the late 2016 macbook pro came out with only usb-c i had to buy a usb dongle from amazon (the one included had not enough ports). if i booted the macbook in windows, with the dongle connected the wifi would stop working (the 2.4ghz one) and the 5ghz would work.
Duly noted! I've been out of the embedded space for a long time (I think the last board I worked with was i386EX based) but I'm getting back into it now with an ESP32 so this might actually come in handy. Thanks! :)