How to identify internet issues

Jorge_Jeferson_Ferre · September 11, 2020, 1:25am

Hello all

1. My goal is: Handle internet issues like wifi disconnected, internet not available, and MQTT queue overflow

2. My actions are:
Based on the demo-js sample, there are these events:

load('api_events.js');
Event.on(Event.CLOUD_CONNECTED, function() {
  print("Connected");
};
Event.on(Event.CLOUD_DISCONNECTED, function() {
  print("Disconnected");
};

3. The result I see is:
When the internet is not available (wifi is ok), only the Event.CLOUD_CONNECTED is triggered when the internet returns.

4. My expectation & question is:

If there is a way to identify when the internet is down or wifi is not available, I can blink the led as a warning.

How to capture cloud issues like the internet is down, wifi is not available, queue overflow, and other issues?

scaprile · September 11, 2020, 2:20pm

Well, “the Internet” is not an entity you can ask if it is available.
The cloud related events take advantage of actual connection to an MQTT broker or equivalent. If you are using one or more of the available clouds, most (all ?) provide a .isConnected() method you can use to check availability before sending a message, and so avoid MQTT queue overflows which happen because of the way the MQTT library works (it stores your message in the heap and returns, it will be sent later).
You can check your WiFi status at the WiFi library level, either register for an event or call mgos_wifi_get_status(), check here.
Since you know your gateway address at configuration level, you can ping your gateway. Having a known server in the Internet, like Google DNSs, you can ping it. I don’t think there are mOS mechanisms to ping… I guess you’ll have to do it at the lwIP level. Be aware that some environments block pings through their firewalls. In such a case, you can try a connection to a well known server.
However, since a reputable network like Google or Amazon is not expected to be down for a long time, I guess just checking WiFi and cloud states is OK for normal applications.

Jorge_Jeferson_Ferre · September 22, 2020, 12:52am

Thank you @scaprile for your suggestion here.

Internet issue is more recurrent if you are using mobile internet from sim card, for instance.

My device is connected through a wifi router (very near and good signal), using optical fiber to the Internet. But the device may not be online for a second and lose some command. My device is commanded to turn a water pump on and off. There was one time when the device did not receive the command (see image) to turn off and my mother-in-law’s kitchen was flooded
THAT IS WHAT I WANT TO AVOID

I am working to capture this error from the user interface. When the command doesn’t achieve the device, the interface is going to raise a friendly user message, but I need somehow to re-send the command asap.

That said, i tried GCP.isConnected() and MQTT.isConnected(). You can check by yourself. Turn on your wifi internet router from your cellphone. Connect your device to your cellphone. After some time the GCP.isConnected() and MQTT.isConnected() will return “true”. That is fine.

Then disable your mobile internet (keep the wifi, but without internet access). The GCP.isConnected() MQTT.isConnected() and MQTT.pub() still return “true” . Event.on(Event.CLOUD_DISCONNECTED, function() {}) also doesn’t work here. After some time, you are going to see “MQTT0 queue overflow”.

Here is what i discovered:

When the device sends data using MQTT.pub() with QoS > 0, after some milliseconds is possible to capture the “ack event” for that published message like this

MQTT.setEventHandler(function(conn, ev, edata) {
  if (ev !== 0) {
    if (ev === 204) {
      print('MQTT event handler: got', ev, '. Ack for publishing of a message with QoS > 0.');
      lastEventHandler.puback = upTime;
      GPIO.write(led, 0);
    }
  }
}, null);

I can consider the internet issue when the ACK event does not return before 5 seconds. In this case, the device is going to stop sending data until it receives the ACK event again.

I really would like a better solution here. I can also consider a workaround from the GCP side.

scaprile · September 22, 2020, 2:15pm

Well, the isConnected() functions can not do magic nor guess beyond what the underlying technology provides. They have to rely on sending a ping over MQTT or a keepalive over a TCP connection, and that means waiting for the expiration of a timer or something like that. That means they will not detect brief interruptions unless there is a real disconnection; I mean, an MQTT disconnect msg or a TCP RST. If your connection just goes silent for a while, both TCP and MQTT will still work, that is what they were designed for. Both TCP and MQTT with QoS >0 will try to deliver your message. QoS = 0 will just rely on TCP. MQTT pings in Google are in the order of the minute.
For your problem, QoS=2 has been invented, but neither Google nor Amazon support it. However, you should be able to work with QoS=1, because QoS = 0 just sends and forgets while QoS=1 waits for an ACK.
If the problem you are trying to solve is tightly coupled, perhaps a highly decoupled architecture is not a proper solution. If you do need to be in touch with your equipment at a specific moment in time, you are using an architecture that is not bound by response times and is not intended for realtime interactivity without taking proper contention measures.
With QoS=1 your message may be sent more than once, be ready for that.

Jorge_Jeferson_Ferre · September 22, 2020, 4:52pm

@scaprile
I have to consider this is not a critical application and the device may lose connection. There are innumerable scenario that may failure…
Perhaps in the future we have improvements that I can consider. Sending a ping over MQTT would be a good idea, but may increase cost and data consumption.

Thank you anyway for your suggestion here. If you or someone has another suggestion, you are welcome to share.

nliviu · September 22, 2020, 5:52pm

The mqtt library sends pings every mqtt.keep_alive seconds.
MQTT.pub returns a positive integer packet id.

scaprile · September 22, 2020, 8:21pm

IIRC, Amazon may not answer a PINGREQ if polled before 30 seconds have elapsed and Google will definitely charge you for every PINGREQ (and might not answer too frequently, don’t have that number available)

Jorge_Jeferson_Ferre · September 23, 2020, 12:20pm

Is there an event handler or anything else when the keep_alive sents ping or MQTT server returns?

Jorge_Jeferson_Ferre · September 23, 2020, 12:29pm

Here is a great workaround:

StackOverflow: Google IOT: Identify the device is back online after sendCommandtoDevice failure

In case of failing to send the command to a device, the application can catch that error and update the device configuration. Device configuration is persisted in storage by Cloud IoT Core.

This will have other implications that I have to consider, but it is going to reduce the chance for the kitchen to flood again

scaprile · September 23, 2020, 2:42pm

I don’t want to dive in all the details and implications of your project. just tell you that in my book, if you do need to act based on local considerations, you need a local brain.
You seem to have a closed loop (again, I didn’t check it in detail and I won’t) and you are introducing a variable unbounded delay in it. That’s a hell of a control system to stabilize.

GCP sends the configuration file on every connect/reconnect and on every change. You pay for it.
Configurations: Latest config is retried until delivered (MQTT)
Commands: Retried for QoS 1, but not guaranteed to be delivered
(from this doc)
Whatever you do with GCP, if you process with an app through Pub/Sub, you also have a delay there; Pub/Sub seems to store a bunch of messages and copy batches of them, so you get nothing and suddenly you get tons of messages together.
I wrote a stripped down analysis of GCP, if you can read spanish, it is here

Jorge_Jeferson_Ferre · September 23, 2020, 4:03pm

It is a great documentary about GCP. It needs to be done in English and sharing here!