Mgos_si_7021 causing core-dumps

klimbot · August 22, 2020, 3:05am

I’ve been spending some time recently trying to resolve the source of core-dumps I’ve been getting on my project, and as I resolve some of the issues I’m started to notice a lot of these:

Dump contains FreeRTOS task info
Loaded core dump from last snippet in  /core
0x4010093e in mgos_si7021_read (sensor=0xf2)
    at /home/andrew/Code/mongoose/deps/si7021-i2c/src/mgos_si7021.c:82
82        if (!sensor || !sensor->i2c) {
#0  0x4010093e in mgos_si7021_read (sensor=0xf2)
    at /home/andrew/Code/mongoose/deps/si7021-i2c/src/mgos_si7021.c:82
---Type <return> to continue, or q <return> to quit---
#1  0x40100b5c in mgos_si7021_getTemperature (sensor=0xf2)
    at /home/andrew/Code/mongoose/deps/si7021-i2c/src/mgos_si7021.c:149

Link to si7021 lib line 82

I’ve broken down the if statement a little and found the source of the error seems to be:
if (!sensor->i2c)

Any thoughts?

klimbot · August 22, 2020, 10:35am

I’ve changed this:

if (!sensor || !sensor->i2c) {
  return false;
}

to this:

if (!sensor)
{
  return false;
}
if (sensor->i2c == NULL)
{
  return false;
}
if (!(sensor->i2c))
{
  return false;
}

So far so good, will see how long it takes to reboot

klimbot · August 23, 2020, 12:37am

Hmm few more core dumps…

Got 10x dumps in succession with this error:

0x400ff38e in mgos_si7021_read (sensor=0xf2) at /home/andrew/Code/goldilocks-mongoose/deps/si7021-i2c/src/mgos_si7021.c:90
90        double start = mg_time();
#0  0x400ff38e in mgos_si7021_read (sensor=0xf2) at /home/andrew/Code/goldilocks-mongoose/deps/si7021-i2c/src/mgos_si7021.c:90

And then one that failed at:

if (sensor->i2c == NULL)
{
  return false;
}

scaprile · August 23, 2020, 9:43pm

That doesn’t seem to make sense, the way the original is written the compiler will not dereference sensor if it is NULL. That code will return false if either sensor is NULL or sensor->i2c is.
What do you mean by fail ? core dump/reboot ? The only way for that to happen is after that code if sensor points to somewhere not actually having what it is supposed to have. Any piece of code that writes to memory (for example the next line, sensor->stats.read++) will trash memory somewhere, and sooner or later that will have unintended effects somewhere (else).
Some code of yours seems to be trashing memory, writing where it shouldn’t. Are you using 2.17 or anything newer ? There’s someone else having similar issues though with wifi. I always blame the user, but…

klimbot · August 24, 2020, 12:28am

Based on my past experience resolving errors I’m quite certain its user user error

I’m seeing this same error in 2.17.0 as well as 2.18.0.

I get this which is why I declare the parameters are a global level as static. Am I wrong to assume this means they won’t ever be trashed?

My program is getting pretty long, so I’ll try to summarise the main bits:

#include "mgos_i2c.h"
#include "mgos_si7021.h"

static struct mgos_i2c *i2c;
static struct mgos_si7021 *s_si7021;

static void read_temp(void *user_data)
{
  (void)user_data;

  double temp_offset = 1.55;
  current_inside_temp = temp_offset + mgos_si7021_getTemperature(s_si7021);
  current_inside_humidity = mgos_si7021_getHumidity(s_si7021);
}

enum mgos_app_init_result mgos_app_init(void)
{
  // Setup i2c bus and temp sensor
  i2c = mgos_i2c_get_global();
  s_si7021 = mgos_si7021_create(i2c, 0x40); // Default I2C address

  mgos_set_timer(15000, MGOS_TIMER_REPEAT, read_temp, NULL);                 // every fifteen seconds

  return MGOS_APP_INIT_SUCCESS;
}

The program will run for hours getting a new temp every 15 seconds then all of a sudden core-dump with the following error:

Dump contains FreeRTOS task info
Loaded core dump from last snippet in  /core
0x4010093e in mgos_si7021_read (sensor=0xf2)
    at /home/andrew/Code/mongoose/deps/si7021-i2c/src/mgos_si7021.c:82
82        if (!sensor || !sensor->i2c) {
#0  0x4010093e in mgos_si7021_read (sensor=0xf2)
    at /home/andrew/Code/mongoose/deps/si7021-i2c/src/mgos_si7021.c:82
---Type <return> to continue, or q <return> to quit---
#1  0x40100b5c in mgos_si7021_getTemperature (sensor=0xf2)
    at /home/andrew/Code/mongoose/deps/si7021-i2c/src/mgos_si7021.c:149

No idea why the var would be trashed after a number of hours. Overnight I had 2 core dumps on this, and now it’s been up for 10+ hours.

I was thinking that my program eventually runs out of memory and its just pointing there? Not sure how this would happen, I’m logging mgos_get_free_heap_size and mgos_get_min_free_heap_size regularly and don’t seem them meet.

I would have expected that if I had an overflow somewhere in my program writing to memory it shouldn’t then I would have found out sooner than 10h+, and then would have seen a guru meditation error or something.

I’m thinking now of a way to hard code the struct values in the lib and see if that points to something else going on.

klimbot · August 24, 2020, 3:27am

Scratch that - I’ve just made a couple changes and actually seen the error in my terminal. Its a guru meditation error

scaprile · August 24, 2020, 2:17pm

Unless there is a PMU setup or some runtime pointer checking, an incorrectly initialized pointer or incorrect dimensions for a buffer can trash someone else’s memory. Usually, a C environment does not have runtime checking.

One possible reason.

klimbot · August 24, 2020, 11:50pm

Thanks for the response @scaprile

I’ve been looking deeper in to the values I’ve been logging and am now even more confused.

I’ve been logging mgos_get_free_heap_size() (blue) and mgos_get_min_free_heap_size() (green). I assumed I had solved my memory leak issue when I started seeing the blue line look mostly flat, since in my head I expected to see the current free heap size tick down as periodic functions ran and then back up when the functions finished and memory was reclaimed. I would see the green line gradually tick down over time, then eventually spike signalling to me the device had rebooted. I never really looked deeply in to why that was.

Looking in to the actual functions mgos_get_free_heap_size() and mgos_get_min_free_heap_size() I see my understanding of what min free heap meant was incorrect - mgos_get_free_heap_size() returns xPortGetMinimumEverFreeHeapSize(), which reports the “lowest amount of free heap space that has existed system the FreeRTOS application booted”.

So I’m confused because I don’t understand how my free heap can seem to return back to the same number over time, but the minimum ever keeps ticking down till I run out of space. Only thing I can think of is around how the remaining memory is fragmented, but the FreeRTOS doc says that no function will tell that, so then why is the minimum ever heap size reducing over time?

scaprile · August 25, 2020, 2:36pm

The amount of available heap memory might change between sampling points and you wouldn’t notice. Since you run every x time or at point x in a loop, you only see what persists between two sample points.
A decrease in the minimum ever is an indicator that tasks are collectively requesting more memory.
An increase I can only explain as a reboot, providing I replace “since” into “system” in your quoted sentence.
I may be missing something, though, since I’m not fond of dynamic allocation in embedded systems.

klimbot · August 26, 2020, 12:02am

Of course… makes perfect sense! I’m logging to the terminal every second, but thats probably not even enough resolution to catch the big dips.

I think I’ve got a solution while not actually solving the problem - I’ve changed from a global instance of the mgos_si7021 instance to creating and destroying on each read.

static void read_temp(void *user_data)
{
  (void)user_data;

  double temp_offset = mgos_sys_config_get_goldilocks_temperature_offset();
  
  struct mgos_si7021 *s_si7021_local = mgos_si7021_create(mgos_i2c_get_global(), 0x40);

  current_inside_temp = temp_offset + mgos_si7021_getTemperature(s_si7021_local);
  current_inside_humidity = mgos_si7021_getHumidity(s_si7021_local);
  
  mgos_si7021_destroy(&s_si7021_local);
}

Been running for 15h now rock solid.

So strange why the global instance was getting corrupted since I’m seeing the same free_ram numbers after 15h that I was seeing when the program started. No other changes.

scaprile · August 26, 2020, 2:55pm

Perhaps the problem is on other function writing outside boundaries and trashing the globals this function relied upon…