WDT: Who calls 'mgos_wdt_feed' (ESP32)

Rolf · June 24, 2021, 1:53pm

Obviously, I do not understand the WDT functionality correctly:

First, in my application I call ‘mgos_wdt_set_timeout()’ followed by ‘mgos_wdt_enable()’.
Then, I call ‘mgos_wdt_feed()’ repeatedly.

Then - just for test - I stop calling ‘mgos_wdt_feed()’ after some seconds.
And: Nothing happens! No reset, nothing!

Who calls ‘mgos_wdt_feed()’ except my own application code?

Rolf · June 24, 2021, 2:03pm

By the way: I set ‘mgos_wdt_set_timeout(6)’.Is 6 seconds out of a certain range?

scaprile · June 24, 2021, 3:10pm

(long intro for newcomers: at the end)
The ‘System’ module is documented here, as you probably already know.
Your understanding seems correct to me, however, the IDF docs state:

If a task does not reset within the TWDT timeout period, a warning will be printed with information about which tasks failed to reset the TWDT in time and which tasks are currently running

The implementation for the ESP32 is here and as we can see, it calls the IDF functions, though I fail to see how and where it calls the referred esp_task_wdt_init() function.
There is this log entry:

[Jun 24 11:57:13.967] mgos_sys_config.c:374   WDT: 30 seconds

which suggests mOS is using the WDT for its own purposes.
Unfortunately I don’t have further time to play along with this now; I wrote a small piece and confirmed what you are experiencing, perhaps @nliviu can shine some light here.
In the meantime, perhaps you’d like to play with the IDF functions and comment. Regards.

#include "mgos.h"

enum mgos_app_init_result mgos_app_init(void) {

    mgos_wdt_set_timeout(2);
    mgos_wdt_enable();
    // mgos_wdt_feed();
    return MGOS_APP_INIT_SUCCESS;
}

… and now the intro:
Generically speaking, if you run on bare metal of course you “kick the dog” as often as needed (unless it is a windowed watchdog, in which case you kick it within some interval). If you have an OS, which is the case here, usually the OS handles the hardware watchdog and provides an API for you. Your app initializes a “virtual watchdog” and complies with the API to signal the OS it is alive (and kicking). In the case of the ESP32, the IDF provides its own interface here.
In mOS, there is an extra layer providing a common interface for the various supported microcontrollers.
Some implementations actually require a first kick for the WDT to be active, so simply setting it up and letting it expire won’t work, you need to first kick it and then stop doing it (the developers here seem to care about animals and they ‘feed’ the dog instead of ‘kicking’ it (him/her), but you get the picture). It doesn’t seem to be the case here.

Rolf · June 24, 2021, 7:34pm

Thanks a lot for your verification, Scaprile!
I just loaded the same project onto an ESP8266 (I provided some features to be able to compile it for ESP8266, too). The result is the same: The Watchdog does never bite, with or without a first (or several) feeds/kicks!
I hope, @nliviu can help …

Rolf

klimbot · June 25, 2021, 1:01am

Not sure if it helps, but the only way I’ve accidentally tripped the WDT was when executing time consuming calcs in a function block without a delay. Adding a delay fed the WDT

IRC there was something about mOS feeding the WDT when a user function ended, but I don’t seem to be able to find the docs I remember reading easily

scaprile · June 25, 2021, 1:56am

I’ve seen calls to mgos_wdt_feed() in the core dump loop, so mOS may handle the WDT on its own and so it is not tripping as there is no function being called (this is an event-driven framework); sounds possible.

nliviu · June 25, 2021, 11:12am

The user code does not need to call mgos_wdt_enable or mgos_wdt_set_timeout. The Mongoose OS’s init code takes care of that (it sets the timeout to the value specified by sys.wdt_timeout).

AFAIK, the dog is fed by mongoose_poll in the main task.

When doing some lenghty processing or a tight loop which takes more then sys.wdt_timeout seconds, the WDT exception will be triggered.
To avoid that, one can add a call to mgos_wdt_feed or mongoose_poll.

Simple example:

#include "mgos.h"

static void timer_cb(void *arg) {
  static bool s_tick_tock = false;
  LOG(LL_INFO, ("%s uptime: %.2lf, heap free/min_free: %zu/%zu", __FUNCTION__,
                mgos_uptime(), mgos_get_free_heap_size(),
                mgos_get_min_free_heap_size()));
  s_tick_tock = !s_tick_tock;
  (void) arg;
}

void tight_loop(void) {
  while (true) {
    LOG(LL_INFO, ("%s uptime: %.2lf, heap free/min_free: %zu/%zu", __FUNCTION__,
                  mgos_uptime(), mgos_get_free_heap_size(),
                  mgos_get_min_free_heap_size()));
    mgos_usleep(100000);  // sleep for 100ms, blocking
    // mongoose_poll(0);     // other events will be processed
    // mgos_wdt_feed(); // other events will not be processed
  }
}

void start_tight_loop(void *arg) {
  tight_loop();
}

enum mgos_app_init_result mgos_app_init(void) {
  mgos_set_timer(1000, MGOS_TIMER_REPEAT, timer_cb, NULL);
  mgos_set_timer(10000, 0, start_tight_loop, NULL);
  return MGOS_APP_INIT_SUCCESS;
}

Comment out one of the lines mongoose_poll or mgos_wdt_feed and see how it works. Tested with ESP8266 and ESP32.

scaprile · June 25, 2021, 2:28pm

Good point, thanks!
I guess that in order to check some specific task periodicity or some actions being done within a specific time frame, one has to register a new task watchdog with the IDF.

nliviu · June 25, 2021, 3:17pm

Yes, I think so, because the watchdog seems to be enabled per task in the case of ESP32.

Rolf · June 25, 2021, 5:42pm

Hi all,
Thanks for your interesting inputs and the discussion.
It is clear to me, that one needs to prevent timeouts during time consuming processes, by feeding the watchdog. This is normal programming skill.
But in an eventdriven system as mine, I want the watchdog to check for missing event(s). So I want to hook a piece of code which lets the watchdog bite in case it is NOT called! I want to monitor unpredictably missing calls.
If each task now feeds the watchdog by itself, I will never find my missing events. THIS IS NOT MY UNDERSTANDING OF A WATCHDOG AT ALL!
Scaprile: What do you mean by ‘Register a new task watchdog in the IDF’? How could I do that?

scaprile · June 25, 2021, 7:11pm

The main idea is that in the case a system goes nuts, there should be a way to detect it and reset.
In a single loop system you usually kick the dog at only one place, if you need to check for interrupts being run periodically, you set flags and check for them before kicking the dog. The idea is that the loop has to complete in some time, if it doesn’t, the cpu might be stuck in a loop somewhere due to a software bug or noise altering a state variable or … Modern micros usually jump to a hard fault or such when something wrong is detected, the first 8-bitters could do all kind of strange stuff and go all the place incrementing the address counter and “executing” whatever was taken as bus contents, as they mostly did not have a way to know something wrong had happened, so this schema made a lot of sense.
In a system with an OS where you have several tasks, generally each task is a loop, so it makes sense to assign a virtual WD or task WD to each task, though you can still run your own schema.
On an event-driven system, there are mostly event handlers, they run in response to an event and register another event to be triggered to complete whatever remains to be done. In this case, there is not much sense to think of “loops”. However, we can still apply the same WDT concept here if we correctly choose an event that checks the other events were triggered before kicking the WD. Since mOS is itself using the main WDT (and probably checking everything looks OK for us); we need our own WDT

I didn’t read the whole doc (link in my first response), I guess we can ask the IDF for a virtual watchdog to be kicked by us, don’t know if that has some implications with FreeRTOS and all other tasks, since we are running inside the mOS context, which runs over FreeRTOS, courtesy of the IDF. My guess is we can get a simple handle for a virtual WD (TWD) and use it without a hassle, but haven’t tried.

Rolf · June 26, 2021, 10:22am

Hi all,

So, up to now, I’ve learned, that the mOS hijacks the task watchdog for its own use, unfortunately without an additional hook for the application. So far, so good!

Until now, my application is just hooked on the different events/callbacks and one single timer.

Therefore, to solve my watchdog problem, I create a new task, in which I call ‘mgos_wdt_set_timeout(…)’ and
‘mgos_wdt_enable()’ and then - inside the task loop - ‘mgos_wdt_feed()’, in case the system still works properly. Otherwise, I do not ‘feed’ anymore.

But now I encounter the next problem that leads me to a perhaps very stupid question: Does the ESP8266 not use the FreeRtos? I cannot compile because FreeRtos.h is not found …
Is there none or another OS underlaying?

Rolf

nliviu · June 26, 2021, 11:58am

ESP8266 is using the nonos-sdk.

Rolf · June 26, 2021, 1:54pm

Is it the same in the nonos-sdk (as in the FreeRtos for ESP32), that mOS hijacks the watchdog for its own purpose and without a hook for the application?
If so, this must be classified as a mistake in mOS, because there would not be any chance for the application to detect missing events/calls …!
(Because with nonos-sdk there is no possibility to create a new task as it is in FreeRtos)

scaprile · June 26, 2021, 5:18pm

My short adventure with the ESP8266 is more than 5 years old perhaps. At that time Espressif had two SDK options available: nonos and os. The os version I didn’t even see, it used an RTOS (I’m not fond of them) and I guess it is FreeRTOS but I may just be totally wrong. The nonos version is a yes-OS version, though the OS is some event-driven Espressif’s proprietary scheme that rules everything. In order for the chip to work as a “software WiFi chip” and also execute user application code, they crammed modified versions of lwIP and other interesting stuff and gave us a recipe to use them; those versions usually call proprietary versions of the C system library (e.g.: memory allocation) in order to coexist with how the chip was conceived.
mOS is currently based on the non-os SDK

I don’t see it as a mistake, you just have to think in a different way.
You may trust mOS, the SDK, or none of them. If you don’t trust anybody then fire a hardware timer or wire an external windowed WDT. If you don’t trust mOS, start a timer at the SDK level. If you trust both, then there is no reason for missing events, so check your app through a periodic mOS timer event.
Unless there is a real hardware memory protection unit in place, there is not much of a difference between asking for a virtual WD and starting a hw timer, both events can be masked by soft errors.
For something missing to be detected, there has to be someone expecting, and in an event-driven paradigm there is no one but the event loop as such. If you need to be vigilant, you need a watcher task or a watcher periodic event. At the time you start your mOS tasks, start this watcher with a repeatable timer; your actions may then set flags for the watcher to check.

Rolf · June 27, 2021, 6:16am

Hi Scaprile,

Most of all, I do not trust my own code! Both mOS and SDK are tested thousands of times, but my own code is not!

If you don’t trust mOS, start a timer at the SDK level. If you trust both, then there is no reason for missing events, so check your app through a periodic mOS timer event.

Here, I totally disagree with you! It is a fundamental difference between a real watchdog and a software or even hardware timer: The watchdog is a block of dedicated hardware, directly wired to the reset line of the processor. It is unbribable an will do its job (namely to bite) whatever you do, in any case you do not feed it correctly. On the contrary, a software/hardware timer needs a piece of software to actively reset the processor. And, if you are unlucky (and you ARE unlucky in this case) exactly this piece of software is corrupted and does not work anymore. So, you will never find out, that the software is not running anymore!

So, my understanding (and my experience, too) is: It is totally impossible to replace a real watchdog by any piece of software!

The only thing that would help is the possibility to hook an application-function into the piece of software, that feeds the watchdog. This application-function must return let’s say ‘true’ if everyting it checks (and this can be very widespread checks in the application) is ok and ‘false’ in all other cases. The watchdog system now must ensure to stop feeding the hareware watchdog, if my function once returns ‘false’. With this functionality, the watchdog bites unbribably, either if my function is called and returns false OR (and this is the important thing) if my function is not called anymore at all.
This is how a watchdog must work!

So, I think that hijacking the hardware watchdog in the mOS for just its own purpose (without a hook for the application) is not a good idea!

Rolf

scaprile · June 28, 2021, 2:06pm

First of all, you are not getting a hardware watchdog in this environment, what you will get is a software watchdog (virtual, task). The main piece of responsibility here is the OS, not you. The OS is the one in charge of the hardware and provides to you enough control not to break anything. This is not bare metal.
AFAIK you won’t find an OS that let’s you do what you want to do, unless it leaves the WDT alone and only manages memory and task switching.
Besides that, the one that took control of the hardware WDT is the IDF, not mOS. mOS is just kicking the virtual WDT that it asked the underlying layer (the IDF, the nonos-SDK, FreeRTOS…) for, in order to check its event-driven loop, which it is responsible to manage.
If you want a hw wdt, you can wire an external one. Otherwise, you can use a hw timer, and if you trust mOS and the SDK you can use a software timer, there is not any functional difference here. I personally don’t trust the ESP8266 SDK.
And BTW, hardware watchdogs can and do fail, the wdt that gives the highest probability of achieving its goal is an external windowed watchdog. The clock may be shared with the cpu clock and fail, if it is different clock it may also fail, runaway code may write its register, if it is pw protected then the sequence may me mimicked, your own code may fail and still be kicking the dog, buggy code might disable it (as there is not an MPU here, runaway code can do whatever it wants to the hardware and no exceptions will be triggered), etc, etc, etc.
This is a chain of trust, you have to choose a layer and below that you will trust whatever is provided to you; otherwise you build everything from scratch. That is why I asked you if you trusted mOS. If you don’t trust your code, then check your code with the tools mOS and the IDF provide you; unfortunately it is still your code, as in a bare metal system it is still your code that enables and kicks the WDT.

Rolf · June 29, 2021, 7:54am

Yes, you are absolutely right: The watchdog is a very delicate piece of firmware and must be handled carefully. I agree in most of your arguments. Just one exception: A good hardware watchdog is not stoppable by software what makes it much more reliable than the same functionality in software.
However, I have a solution for ESP32 (using an additional task just for the watchdog) and for the ESP8266 I will take a timer. Its second best but ok!
Thanks for your inputs, Scaprile!

scaprile · June 29, 2021, 2:31pm

We are talking probability = 0 here.
If you can’t stop it by software then you can’t kick it… you can make the kick action as complicated as you like, in order to minimize the probability that an unintended action works on it, but it is still there.
Remember than the best watchdog in the world will (in most cases) be taken over by the OS, if you can’t live with that, roll your own WDT (or work in bare metal).