Crash dump after upgrading to 2.20

I have several hundred field devices that have been running 2.18 for the last couple of years. I recently did some maintenance updates and also migrated across to 2.20 for this particular hardware version. Most of them seem to be OK, however I’ve now had a handful that go offline a few days after the upgrade, and don’t come back online again.

I managed to get my hands on one that was fairly local and realised that it was core dumping immediately when a lock closed (so probably a power issue, but only after the firmware update).

Here’s the core dump:

Loaded core dump from last snippet in  /core
0x4008fdfa in multi_heap_malloc_impl (heap=0x3f800000, size=20) at /opt/Espressif/esp-idf/components/heap/multi_heap.c:432
432	        MULTI_HEAP_ASSERT(is_free(b), b); // block should be free
#0  0x4008fdfa in multi_heap_malloc_impl (heap=0x3f800000, size=20) at /opt/Espressif/esp-idf/components/heap/multi_heap.c:432
#1  0x400838e6 in heap_caps_malloc (size=20, caps=5120) at /opt/Espressif/esp-idf/components/heap/heap_caps.c:111
#2  0x40083b15 in heap_caps_calloc (n=<optimized out>, size=20, caps=5120)
    at /opt/Espressif/esp-idf/components/heap/heap_caps.c:329
#3  0x40083b7a in heap_caps_calloc_prefer (n=1, size=20, num=1) at /opt/Espressif/esp-idf/components/heap/heap_caps.c:231
#4  0x400861fd in wifi_calloc (n=1, size=20) at /opt/Espressif/esp-idf/components/esp32/esp_adapter.c:92
#5  0x4008623c in wifi_zalloc_wrapper (size=20) at /opt/Espressif/esp-idf/components/esp32/esp_adapter.c:100
#6  0x40115cbb in esp_wifi_sta_get_ap_info ()
#7  0x4010ceec in mgos_wifi_sta_get_rssi () at /Users/gadget-man/Documents/iParcelBox/3-Firmware/iParcelBox_firmware/deps/wifi/src/esp32/esp32_wifi.c:599
#8  0x401020fd in ffi_call (func=0x4010cee4 <mgos_wifi_sta_get_rssi>, nargs=0, res=0x3ffb82f8 <mgos_task_stack+15376>, args=0x3ffb8220 <mgos_task_stack+15160>)
    at /Users/gadget-man/Documents/iParcelBox/3-Firmware/iParcelBox_firmware/deps/modules/mjs/mjs.c:7434
#9  0x40106859 in mjs_ffi_call2 (mjs=0x3ffdd870) at /Users/gadget-man/Documents/iParcelBox/3-Firmware/iParcelBox_firmware/deps/modules/mjs/mjs.c:10892
#10 0x40108d49 in mjs_execute (mjs=0x3ffdd870, off=<optimized out>, res=0x3ffb8448 <mgos_task_stack+15712>)
    at /Users/gadget-man/Documents/iParcelBox/3-Firmware/iParcelBox_firmware/deps/modules/mjs/mjs.c:9639
#11 0x40109c54 in mjs_apply (mjs=0x3ffdd870, res=0x3ffb84a8 <mgos_task_stack+15808>, func=4611389493885322285, this_val=<optimized out>, nargs=nargs@entry=1, args=args@entry=0x3ffde7b4)
    at /Users/gadget-man/Documents/iParcelBox/3-Firmware/iParcelBox_firmware/deps/modules/mjs/mjs.c:9955
#12 0x40109e89 in ffi_cb_impl_generic (param=<optimized out>, data=<optimized out>) at /Users/gadget-man/Documents/iParcelBox/3-Firmware/iParcelBox_firmware/deps/modules/mjs/mjs.c:10365
#13 0x4010a099 in ffi_cb_impl_wpwwwww (w0=1073604396, w1=1074281972, w2=0, w3=1073600032, w4=1073447888, w5=1)
    at /Users/gadget-man/Documents/iParcelBox/3-Firmware/iParcelBox_firmware/deps/modules/mjs/mjs.c:10430
#14 0x400e66d9 in mgos_timer_ev (nc=<optimized out>, ev=<optimized out>, ev_data=<optimized out>, user_data=0x3ffcd7e4) at /mongoose-os/src/mgos_timers.c:101
#15 0x40187de5 in mg_call (nc=0x3ffcd7f4, ev_handler=0x400e6600 <mgos_timer_ev>, user_data=0x3ffcd7e4, ev=6, ev_data=0x3ffb85a0 <mgos_task_stack+16056>) at src/mg_net.c:78
#16 0x401893d5 in mg_timer (now=1716575984.1110661, c=0x3ffcd7f4) at src/mg_net.c:100
#17 mg_if_poll (nc=0x3ffcd7f4, now=1716575984.1110661) at src/mg_net.c:139
#18 0x4018c4b3 in mg_lwip_if_poll (iface=<optimized out>, timeout_ms=<optimized out>) at src/common/platforms/lwip/mg_lwip_ev_mgr.c:119
#19 0x4019ee68 in mg_mgr_poll (m=0x3ffbdd0c <s_mgr>, timeout_ms=0) at src/mg_net.c:283
#20 0x40184440 in mongoose_poll (ms=0) at /data/tmp/mos_prebuild/tmp/cesanta/mos-libs/mongoose/src/mgos_mongoose.c:61
#21 0x400859ee in mgos_mg_poll_cb (arg=<optimized out>) at /Users/gadget-man/Documents/iParcelBox/3-Firmware/iParcelBox_firmware/deps/freertos/src/mgos_freertos.c:103
#22 0x40085bb0 in mgos_task (arg=<optimized out>) at /Users/gadget-man/Documents/iParcelBox/3-Firmware/iParcelBox_firmware/deps/freertos/src/mgos_freertos.c:222

Can any of you lot much smarter than me help me work out from this:
a) what’s causing the core dump (other than just that it’s a memory allocation error.
b) why it’s only doing it on 2.20, when it was absolutely rock solid on 2.18

Whatever that glitch was causing before, it went unnoticed because 2.18 was hiding it. Now it is exposed.
Maybe 2.20 consumes more power and some juiced-out caps can’t handle it.
Maybe some pin fired a rogue irq that went to a safe place that now is not safe, or some rogue pointer wrote where it did no harm and now it does…
If the logs are consistent, I’d bet on the later; if they’re not, then the former (power supply drops).
Try adding a big cap closest to the micro to try to rule out what can be somehow easily done.