NanoPi R4S — On booting mainline Linux, fixing SD-card support, and a Heisenbug that sucked me into a rabbit hole

I bought this little ARM64 single-board computer. Getting it to work reliably with a mainline Linux kernel dragged me into the world of device trees, voltage regulators, and one of the worst Heisenbugs I encountered so far.

My NanoPi R4S, connected to a well-regulated arsenal of other devices

There’s quite the market for ARM-powered single board computers with Rockchip, Amlogic, Allwinner, Broadcom, and other system-on-chip (SoC) brands at its heart. The materialized promise of an affordable, energy efficient but well perfoming system at an affordable price, suitable for a variety of tasks, ranging from sensor controls over home automation, TV media centers, network attached storage, to machine learning applications. An exciting prospect.

Depending on what device you buy, you may or may not get excellent support from the manufacturer, an online community, or maybe even books and magazines with step-by-step instructions. In many cases, the device isn’t fully supported by the “vanilla” Linux kernel, so you’re either stuck with some old kernel build or, by using a newer mainline kernel, you miss out on some features and performance optimizations that haven’t been upstreamed yet.

Out of interest, I’ve bought a few such devices in the sub-$100 range to play with. One that recently caught my attention is the FriendlyELEC NanoPi R4S, a tiny device powered by a Rockchip RK3399 CPU, supplied with 4 GB RAM, a Micro-SD card slot, two USB 3.0 ports, and, most important to my needs, two Gigabit Ethernet ports: I want to build a Linux router that can hide some of my development servers behind a firewall, accessible via VPN, and with enough headroom to run a few other services like a web server or file server. I guess if limited RAM wasn’t an issue, an old Wifi router would also be sufficient for that task. OpenWRT is a great software solution for this use case.

When I got the device, OpenWRT was actually the first thing I tried running. OpenWRT provides a lot of good documentation around a variety of boards, including the R4S. There’s also an OpenWRT forum thread for the R4S, which provides further tips and tricks, such as improving Ethernet throughput. I can confirm that the R4S reaches and even surpasses the 934 Mbps claimed by FriendlyELEC.

I know, FriendlyELEC has their own fork called FriendlyWRT, and they seem to provide some good guidance and documentation on how to run that or their Ubuntu fork FriendlyCore on the machine. Personally, however, I prefer to have critical systems be as close as possible to upstream, the mainline Linux kernel and a major distribution so their forks just aren’t an option for me. I could only hope that any features missing upstream will eventually be ported there.

Getting things to work once is one thing, getting them to work reliably in a maintainable way is much harder than one may think. Especially when it comes to ARM-based single board computers. Why? Because, very much unlike older x86 systems with BIOS or EFI, the entire boot process, from detecting RAM and storage, setting CPU clock and voltage, loading the kernel, over detecting built-in and external devices to actually being able to reboot the system are pretty much different (read: wildly incompatible among SoC manufacturers).

Therefore, when you download OpenWRT, you have to find the exact system image suitable for your ARM board, or your system won’t boot. The R4S isn’t fully supported yet by a release-quality version of OpenWRT, but at least a so-called “snapshot” build (read: unstable) is available for tinkering. A lot of hard work went into supporting these systems, and OpenWRT does a great job documenting these devices.

Thanks to these efforts, setting up the device was relatively easy, despite it being of “snapshot” quality: Download an image file and write that onto a MicroSD card. The R4S boots up OpenWRT quickly, and, once connected to Ethernet, can be configured via ssh and a web browser. Unfortunately, I wasn’t able to get the second Ethernet port to work, it just wouldn’t let me send or receive any data since that was a hard requirement for my use case, I started exploring the option: run a “proper” Linux environment.

This is the point where things got interesting, because it got me off the beaten path. In order to boot some existing Linux distribution like my favorite, Alpine Linux. It does exist for ARM64 but doesn’t yet support booting from my NanoPi R4S. All the heavy lifting a package like OpenWRT already needs to be replicated. Thankfully, this provided me with an exciting opportunity to learn more about the entire boot process than I ever wanted to know.

In a nutshell, to boot Linux off an ARM board, you need a series of “boot loaders” that initialize the board and load the kernel from disk. As I said earlier, this is very much non-standard, and every chipset’s and manufacturer’s approach is a little different here. I’m going to skip over a lot of details around that because this post really isn’t about the boot loader, it’s about what happens when you think you got it working to the point where the Linux kernel starts up.

A common boot loader for ARM boards is “Das U-Boot”. Usually, vendors ship an outdated build with a certain configuration, tweaked to make it “just work”. Using U-Boot from mainline source involves a few more steps, and all I can say is I’m happy I got that part working as well.

When I finally managed to boot the Linux 5.19 kernel from U-Boot, I was greeted with a cryptic message about problems in the Linux kernel’s MMC subsystem, which is responsible for accessing the SD card:

    mmc1: problem reading SD Status register
    mmc1: error -110 whilst initialising SD card

Being unable to boot the kernel off the SD card, I looked into booting over the network using IPXE, a boot loader I had experimented with for an earlier project.

So my plan was to get U-Boot to boot IPXE, which would download a script from a local server that would then boot Linux. Piece of boot, err, cake! Whatever MMC bug there was, I would not be blocked by it because now there was no need to access the SD card from Linux; everything would be loaded from the network directly into RAM. As an additional benefit, debugging the MMC bug was simplified, now that even the Linux kernel gets loaded off the network: I wouldn’t have to juggle with the SD card (or use an USB-device that can simulate one).

Ejecting the card from the NanoPi, inserting it into a working computer, writing a new kernel to it, then again ejecting and reinserting the SD card: thanks to networking booting, all that was no longer necessary. All I needed was another (faster) machine to build the Linux kernel, then store the kernel binaries on the server that IPXE picks up from. I would then “simply” have to poke around the kernel code, rebuild it on the other computer, and then reboot the NanoPi over ssh until the MMC bug was gone. Or so I thought, because rebooting the NanoPi just didn’t work either!

The NanoPi-R4S would alternatingly just hang upon reboot (after reboot: Restarting system) or, if I was “lucky”, reboot back to U-Boot but then get stuck with another MMC related error:

    Trying to boot from MMC2
    mmc_load_image_raw_sector: mmc block read error
    SPL: failed to boot from all boot devices
    ### ERROR ### Please RESET the board ###

After analyzing several patches that distributions typically add for Rockchip boards, I found this little gem from 2019. It hasn’t been merged to mainline, but it’s commonly included by distributions to this date. The change mentions that newer SD cards (“UHS Ultra-High-Speed”) would use a different signal voltage (1.8V) than older SD cards (3.0V) and that U-Boot expects the MMC/SD card system to be in 3.0V mode and would just hang if that unexpected condition was hit. So one workaround would be to get a non-UHS SD card, but that’s no solution.

The patch works around the U-Boot bug by setting the signal voltage back to 3.0V at an opportune moment in the Linux kernel upon reboot, before control is relinquished back to U-Boot. I learned that the signal voltage is controlled by a “voltage regulator” component that is controlled via a specific GPIO (“general purpose I/O”) pin. Note that, at this point, I was in waters very much unknown to me. I didn’t really know much about this hardware stuff at all, but it was very fascinating to learn that I was not alone.

The way the Linux kernel knows about all these configurations, regulators, I/O memory addresses, etc., is typically done with “device trees”, configuration files that are SoC- and board-specific.

It turns out the Linux kernel doesn’t really want to bother at all with the detail decisions an ARM board manufacturer makes when putting together a single board computer, like which GPIO pin maps to which function, be it the SD card voltage, the power and activity LEDs, and so on. The device tree configuration takes care of the heavy lifting. This in turn reduces the need of a custom kernel fork patched by the SoC vendor; long-running forks are really quite hard to maintain.

Device trees (or: devicetrees) are compiled into a binary using the “device tree compiler”, using sources files that are declarative in nature, looking like C and JSON had an extramarital affair. The device tree files are included with the Linux kernel (and also U-Boot) but also need to be specified upon boot, which means the kernel binary can be reused among different boards but won’t function correctly unless the correct “dtb” (device tree blob) is specified as a kernel boot parameter. Using a custom dtb can overclock, undervoltage or simply brick your device. So much power in a little file.

Understanding what configuration gets actually used is a bit tricky when studying the source code: the device tree compiler supports preprocessor includes, meaning that there are several sources shared among a family of boards and chipsets, with subtle differences in the final configuration: defaults can be overridden for a specific board, naming conventions vary even for boards with the same CPU but different vendors, and so on.

        
      
/dts-v1/;
#include <dt-bindings/input/linux-event-codes.h>
#include "rk3399.dtsi"
#include "rk3399-opp.dtsi"

/ {
    //...
    vcc3v0_sd: vcc3v0-sd {
        compatible = "regulator-fixed";
        enable-active-high;
        gpio = <&gpio0 RK_PA1 GPIO_ACTIVE_HIGH>;
        pinctrl-names = "default";
        pinctrl-0 = <&sdmmc0_pwr_h>;
        regulator-always-on;
        regulator-min-microvolt = <3000000>;
        regulator-max-microvolt = <3000000>;
        regulator-name = "vcc3v0_sd";
        vin-supply = <&vcc3v3_sys>;
    };
    //...   
};   

A snippet from the devicetree configuration for RK3399 NanoPi devices, showing the SD-card voltage regulator

When I compared the devicetrees of certain RK3399 boards, I noticed that the GPIO pins used for the SD card voltage regulator varied, certain parameters were missing in some configs, and some configs had extra parameters that I hadn’t seen elsewhere.

After a bit of trial and error, such as deliberately using the wrong device tree (which I wouldn’t recommend doing since it could damage the board), the system would boot up without problems. With the one kernel patch mentioned above, it would also not hang upon warm-booting U-Boot. I was up to something!

I thought that I had found the reason why the device wasn’t working properly: I assumed that the devicetree config for R4S just had the wrong GPIO pin configuration, and since changing it to the one I had found in the other devicetree fixed it, I thought it was the end of the story.

So I followed my civic duty and submitted a patch to the Linux kernel mailing list: “arm64: dts: rockchip: Fix SD card init on rk3399-nanopi4”.

This in turn sucked me deeper into the rabbit hole of SD cards, GPIO pins and voltage regulators. Remember, all I wanted was to get my NanoPi R4S boot up Linux without errors.

My patch changed the GPIO pin for the SD-card voltage regulator vcc3v0-sd from RK_PA1 to RK_PD6, which I had seen in the other RK3399 DTS file (rk3399-roc-pc.dtsi for the Firefly ROC-RK3399-PC), and it worked.

But it didn’t work for the right reason, as I learned from the helpful replies on the kernel mailing list. My patch was actually just pointing the regulator at a non-existant GPIO, and thus, the kernel would just do nothing with that voltage regulator. In other words, U-Boot had set up the voltage correctly, and the kernel just wouldn’t even try to do anything else with it. So, the patch wasn’t a fix, it was just a happy accident. I then understood that “RK_PD6” doesn’t refer to a specific GPIO pin; the GPIO number preceding it was also important (&gpio0), since there are multiple GPIO banks, and I just happened to change it to something that worked.

Removing the line declaring the GPIO pin would also “work”, but definitely be no viable option for the kernel maintainers: the Linux kernel would just lose the ability to control the voltage regulator entirely, even if that unbroke some aspect of its main purpose.

So how does one know what the right GPIO pin is and how this all works together? Well, you can see the devicetree configuration as a derivative or simplified form of the board/SoC schematics. Thankfully, most schematics are available online, even though you may have to put a few puzzle pieces together.

FriendlyELEC thankfully provides the schematics for the NanoPi R4S as a searchable PDF, and searching for VCC3V0_SD yields two matches:

NanoPi R4S schematics: GPIO0_A1 is an RK3399 GPIO pin, controlling the SDMMC0 power

There’s also a match for the pinctrl reference sdmmc0_pwr_h:

With NanoPi R4S, the SDMMC0 power is fed from a 3.3V power source (`VCC3V3_SYS`) and controlled using an RT9193 "Ultra-Fast CMOS LDO Regulator"

I’m not totally sure how this works in detail. I can’t quite understand the schematics, but I think it’s fair to say that the 3.0V (VCC3V0_SD) can be toggled on/off, whereas the 1.8V (I have no idea which power rail it is) are always connected to an SD-card pin that is only present on UHS cards.

In any case, I find it rewarding that I now somewhat better understand how these things are connected. For a generally better explanation from a generally better person, be sure to check out Louis Rossmann’s 15 minutes about power rails, and watch till the end for a relevant rabbit metaphor.

Equipped with this understanding about how RK3399 boards control the power for SD-cards, I think I can claim that the &gpio4 RK_PD6 GPIO_ACTIVE_HIGH reference for vcc3v0_sd in rk3399-roc-pc.dtsi is wrong. In the corresponding, and sadly, quite incomplete technical document for that board, I couldn’t even find a reference to GPIO4_D6. Since I don’t own such a Firefly board, I can’t speak more to it but my guess is that any RK3399 configuration that uses any GPIO other than GPIO0_A1 is wrong; that pin is used by RK3308 SoCs (called SDMMC_PWREN) and maybe really is just a copy-paste typo. And it works just the same way my original patch “worked”: by luck.

(EDIT: Markus Reichl pointed me at another schematic PDF for the Firefly ROC-RK3399-PC which indicates that they indeed use GPIO4_D6 for their voltage regulator – so another lesson learned: get any available documentation for your board and get comfortable dealing with incomplete or even conflicting specs).

Back to my Linux kernel mailing list thread, Robin Murphy from ARM helped to better put the problem in context. My issue reminded him of the “Tinkerboard problem” (which certainly sounds better than my “R4S situation”), where the signal voltages for the SD card were indeed at the wrong level (1.8V vs 3.0V). In fact, the patch I had found earlier was exactly the remedy for that it fixed the reboot-to-u-boot case but obviously not everything.

Robin further explained that the issue I’m seeing could be a slow voltage regulator at fault on my particular board. Yes, it turns out there’s such a thing as “slowness” in the digital-analog realm of modern computing!

Setting a GPIO pin from 0 to 1 doesn’t mean the effect is immediate. The connected regulator may need hundreds of microseconds or even milliseconds to reach and stabilize the desired voltage an eternity in relative terms where CPU operations usually take nanoseconds (i.e., a billionth of a second). Be sure to see this demonstration by Admiral Grace Hopper for an amazing visualization of nanoseconds.

Robin was kind enough to hook up an oscilliscope to measure the regulator voltage change of his RK3399-powered NanoPC-T4 (not quite my R4S, but close enough). It turns out it takes around 160 microseconds to truly reach the destination voltage (which is not nothing, and definitely more than the 50 microseconds we see in the regulator specs, although the voltage is “roughly” right after that time), but he didn’t see any problems on the MMC driver side even when cycling (unbinding and rebinding to re-trigger the regulator).

Oscillator plot showing a 161 us delay until the regulator voltage settles; courtesy of Robin Murphy

Captivated by this insight, the neither-owning-nor-knowing-how-to-use-an-oscilloscope-me tried to look for devicetree configuration options that may “properly” fix the issue. After all, looking through the documentation, there is a variety of parameters to play with.

I stumbled upon the regulator-uv-protection-microvolt devicetree option, which is supposed to guard against undervoltage situations, something that was close enough to merit an experiment. When I set the undervoltage value to 3.0V and rebooted the machine, Linux would suddenly no longer fail detecting the SD-card!

Well, except when I tried a couple more times, it actually did fail again. I then assumed that — sigh — this is deep in analog territory, and 3.0V might be just too high of a limit, so let’s try the minimum voltage SD cards are supposed to handle, which is 2.7V, and, indeed, a couple more reboots continued to show my intuition was right. At least until I submitted the revised patch…

When Robin thankfully chimed in again, it was clear that something else was at play here.

Well this has to be in the running for “weirdest placebo ever”… :/

It turns out that the R4S’s voltage regulator is not very complex, and in fact not capable of being controlled against undervoltage. So the new devicetree setting was again somehow changing the parameters that triggered the problem, but it was no solution. Robin quipped this had to be running for the “weirdest placebo ever”.

Robin clarified that all this setting did was to write a warning to the kernel log (“IC does not support requested under voltage limits”), Maybe, he assumed, the regulator was being turned off and on again by regulator code, and that writing that line took long enough to be a proper delay to have the regulator reach its target voltage.

Equipped with an oscillator, Robin was actually able to verify his hypothesis!

…and apparently the answer is yes, it seems to be doing exactly that (see attached). But seemingly my SD cards don’t mind, or maybe my T4 board happens to have more capacitance than Christian’s R4S so my voltage dip isn’t as bad, or both.

Oscillator plot showing an intermittent voltage drop caused by double-toggling; courtesy of Robin Murphy

That brought us closer to understanding what was going on. I felt both very unlucky (R4S board with low capacitance, maybe the wrong SD card) and very excited at the same time! The “proper” solution was close.

Robin suggested to remove the regulator-always-on statement from the devicetree setup, which means Linux would not try to toggle the regulator until the MMC driver actually needed it. Removing a line instead of adding anything new, I like that!

Unfortunately, even with the “tinkerboard” patch this change broke rebooting the machine… Without the regulator-always-on, the kernel tries to deactivate the voltage regulator before rebooting, and that somehow causes the system to lock up.

On the other hand, we had a fix. Robin’s initial hunch that hardcoding a delay in the regulator setup code (set_machine_constraints) was correct and working, and practically identical to specifying a devicetree setting, off-on-delay-us.

Of course, fixing the devicetree instead of the kernel means we could unbreak existing kernels by supplying them with a new blob, which could be preferable over requiring an upgrade to the bleeding edge. off-on-delay-us adds a specific, constant delay, so a quick toggling of the regulator would not cause the observed glitch. This sounds like it’s the right approach since the delay is handled by existing kernel code, even the most basic voltage regulator will support it:

        
            vcc3v0_sd: vcc3v0-sd {
                off-on-delay-us = <160000>;

The proposed change to the NanoPi configuration

Except, I found, that delay is not honored in our case!

Deep in the kernel’s regulator code, whenever a regulator is turned off, a “last off” timestamp is updated (and then later checked against the current time) except when the regulator is marked as “always on” or “boot-on” (= “bootloader has enabled it already, but we can turn it off”). Omitting the last-off assignment in that case seems like a reasonable micro-optimization, but it actually fails to capture the quick toggling that Robin observed with the oscilloscope.

So even though we have a bouquet of configuration options for the devicetree, presently it looks like we have to patch the kernel.

I submitted a patch to remedy the off-on-delay bug, and it thankfully got merged into linux-next in time for the Linux 6.0 release.

Looking deeper into why there was a double-init, toggle or something like that around the set_machine_constraints code in regulator core, I added several debug log statements to see what regulator was initialized when and how. Yes, debugging by printing is the way to go here.

A couple of hard reboots later I had figured it out. Until then, I had connected the R4S to a switchable USB hub the R4S doesn’t have a power button, and yanking the cable out every time I needed a restart lost its appeal about 5 hours into this chase. I already thought about automating this toggle as well. This probably would’ve lead me deeper into the rabbit hole, making me hook up an ESP32 USB relay for the power supply so I could debug this situation without having to walk up to the device at all. Alas, that’s for another project.

Back to the regulator code, drivers/regulator/core.c to be specific. There are several mentions of issues resolving supply names early (since a voltage regulator is supplied by voltage coming from somewhere else — probably another regulator — the kernel needs to do this in an efficient way). My debugging has shown that, with the existing regulator initialization code, we indeed toggle vcc3v0-sd twice in rapid succession! Furthermore, if we initialized supply names and constraints a little earlier in the registration code, a double-initialization can be avoided, and we can stop worrying about the off-on-delay and its implications on devicetree configurations!

That patch, “regulator: core: Resolve supply name earlier to prevent double-init” (early discussion here, and here) has been merged for 6.1 as well.

My understanding is that my fixes are addressing issues that are not specific to my NanoPi R4S at all. There’s a good chance that other hard-to-reproduce issues, maybe around certain MMC/SD cards on other boards, or entirely different areas where voltage regulators are cycled too quickly, are now suddenly fixed. We may never know.

This excursion took way longer than anticipated, yet it was truly rewarding for me in several ways. Not only did I learn a lot of new tricks at the very low level intersection of ARM hardware and the kernel, I also really enjoyed working with the kernel community to come up with a solution that not only “works for me” but would be the proper fix, resulting in several patches that are likely to be included in the upcoming Linux 6.1 release.

Lastly, it made me doubt my ability as a software engineer, in a good way. Bugs like these, Heisenbugs, the ones that change their behavior at the slightest observation or attempt to fix them (such as adding a debug log line or specifying a new parameter, are rare, but they are a humbling reminder that all your time spent on them is especially worth it. Somewhere, someone (who would have just given up or bought a slower SD card instead) can now enjoy their new ARM computer a little more, and this satisfying thought will let me sleep a little better tonight.

NanoPi R4S — On booting mainline Linux, fixing SD-card support, and a Heisenbug that sucked me into a rabbit hole

Further Reading

junixsocket: Unix sockets API for Java; a long story short

AI slop meets Google, or "How to Really Install GlusterFS on Alpine Linux Latest"