esp32: Critical, fatal problem source in some WROVER-E modules

Note: For hörberts sold between October 2021 and February 7, 2022, an urgent immediate firmware update is recommended, See: https://www.hoerbert.com/firmware

All hörberts after February 7, 2022 are no longer affected

A sleeping bug

hörbert’s new electronics, which we have been manufacturing since October 2021, and which hörbert has enriched with many new functions, is based on a very popular and feature-rich processor module. This is the WROVER-E module from Espressif. This is a widely used and widely deployed module worldwide. We use – as could be expected – in hörbert the module with the largest available built-in Flash memory to have enough space for our firmware – now and in the future.

Just like you, we cannot see on this board that a bug is waiting in the processor module to sabotage the processor module itself.

hörbert pcb v2.0

Unfortunately, there is a source of error lurking under the hood of these modules.

Flash is the same as Flash

On the WROVER-E modules that we purchase ready-made, the processor and two memory chips sit beneath the shielding metal hood. One of them is the flash memory, where the firmware remains stored even when the processor module has no power.
As is common in the market, the manufacturer uses different flash memory chips with different sizes and functions from different manufacturers.

So far, so normal.

Similar yet different

All Flash chips used by Espressif have very similar functions and specifications.

One of these functions, which many Flash memories have, is a write protection that can be enabled through special commands. The commands for enabling the write protection differ between products.

And the write protection is part of the problem. And that’s only for one of probably 5 interchangeable types of Flash chips.

In some of our WROVER-E modules, a flash chip from the company XMC is used. Unfortunately, we cannot know in which modules and in how many hörberts exactly this chip operates because it is part of the processor module that we cannot manufacture ourselves. In our case, the chip is the XM25QH128C. (->Datasheet)

Due to an unknown cause, which even Espressif has not yet found, it can happen that the flash chip receives random, perhaps even nonsensical commands. These commands completely corrupt the flash chip. Unfortunately, as a result of the error, the write protection of this flash chip is also activated, making it impossible to erase or recover it.
The main issue is that the corresponding status bits for the write protection are irreversibly set.

As a result, the chip becomes useless for all further operations and must be physically replaced by us with a new chip.

The solution: As you do to me, so I do to you!

The support team from Espressif has sent us a solution in the form of a code snippet, which we have promptly integrated into a new hörbert firmware. The downside is: The solution can only prevent the error from occurring before it strikes. The great part is: The solution prevents the error from striking at all.

And this is what the solution looks like:

Since we do not know the cause of the error (garbage data?), but only its effect (write protection!), we prevent the write protection from being set, and that permanently! We do not need this write protection for hörbert’s functionality, so we simply shut the door in the face of the source of the error.

Therefore, even if the flash chip receives erratic commands in the future, it can no longer self-lock.

For programmers

The chip XM25QH128C is connected to the WROVER-E module via SPI. In addition to write and read commands, the three 8-bit wide status registers SR1, SR2, and SR3 can also be set.
The bits SRP1 (“Status Register Protect 1”) in Status Register 2 and SRP0 (“Status Register Protect 2”) in Status Register 1 are responsible for write protection.

Setting these two bits to “1” locks the modification of all three status registers permanently, because these Status Register Protect bits are “OTP” (one time programmable) bits. Once set, they cannot be cleared. Therefore, in our new firmware, we set all three status registers to appropriate values needed for our hörbert, and then set both Status Register Protect bits. This can be done only once.

In our case, we can safely set the registers to the values 0x600380. (S1=0x80, S2=0x03, S3=0x60) These settings are the correct values for our firmware and functions, and by setting the SRP0 and SRP1 bits, they will remain so.

How do I find out more about the flash chip?

The chip ID can be read using esptool.py.
In our case, the Manufacturer is “20” (XMC) and the Chip ID is 4018 (128 MBit Flash)

> esptool.py flash_id
esptool.py v3.1-dev
Found 2 serial ports
Serial port /dev/ttyUSB0
Connecting........_
Detecting chip type... ESP32
Chip is ESP32-D0WD-V3 (revision 3)
Features: WiFi, BT, Dual Core, 240MHz, VRef calibration in efuse, Coding Scheme None
Crystal is 40MHz
MAC: 94:3c:c6:c1:55:e4
Uploading stub...
Running stub...
Stub running...
Manufacturer: 20
Device: 4018
Detected flash size: 16MB
Hard resetting via RTS pin...

The status registers can also be read using esptool.py. Here is an example of a misconfigured chip with the values of the status registers 0xe37bfc (S1=0xfc, S2=0x7b, S3=0xe3)

> esptool.py read_flash_status --bytes 3
esptool.py v3.1-dev
Found 2 serial ports
Serial port /dev/ttyUSB0
Connecting...
Detecting chip type... ESP32
Chip is ESP32-D0WD-V3 (revision 3)
Features: WiFi, BT, Dual Core, 240MHz, VRef calibration in efuse, Coding Scheme None
Crystal is 40MHz
MAC: 94:3c:c6:c1:55:e4
Stub is already running. No upload is necessary.
Status value: 0xe37bfc
Hard resetting via RTS pin...

What does the error look like on the console?

This output is seen when the error has already occurred, and the flash chip has become unusable. The WROVER-E module is stuck in an endless boot loop and does not run any other program except trying to load the second stage bootloader.
ets Jul 29 2019 12:21:46

rst:0x1 (POWERON_RESET),boot:0x33 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:380
ho 0 tail 12 room 4
load:0x07800000,len:3378177
ets Jul 29 2019 12:21:46

... and so on ...

Talk is cheap, show me the code!

Disclaimer: Do not use this code for your project if you do not understand which status bits you are allowed to set and which not!

This code was fortunately sent to us by Espressif as a basis for our own fix support. It requires Espressif’s esp-idf. It runs in RAM and first checks the flash ID 0x204018. Then it reads the status registers and combines S2 with the essential Status Register Protect bit SRP1, and sets bit SRP0, together making up the actual fix.

#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "soc/spi_reg.h"
#include "esp32/rom/spi_flash.h"
#include "esp_spi_flash.h"
#include "esp_task_wdt.h"
#include "soc/spi_struct.h"

... and so on ...

This method is properly understood when looking at the status registers of the flash chip.

The status registers of the flash chip in use

* Prices including 19% VAT