ESP32 BLE gateway dying every X days

Same, still holding strong

1 Like

I think I maybe have found something…

So the story as follows; Been running the ESP32 for testing via the serial monitor in platformio. In our home network I have a scheduler running that once a week restarts my wireless AP (power off —wait — power on). It seems this is “good for the AP behavior”)

Anyway, this happens 4:30 am on Mondays (today). When this happened I could see that the ESP32 BLE gateway stopped working and never recovered. From the log it looks like it is somehow looping, retrying to reconnect the wifi but never succeeds. Closing the serial monitor and restarting helped, now everything restarted from scratch ok

See the following from the serial monitor log:

client not connected can't pub
client not connected can't pub
Creating BLE buffer
device detected
75B2426C1801
BLErssi
-95
txPower
-59
BLE DISTANCE :
35.51
client not connected can't pub
client not connected can't pub
MQTT connection...
[E][WiFiClient.cpp:232] connect(): connect on fd 58, errno: 118, "Host is unreachable"
failure_number
1151
failed, rc=
-2
Scan end, deinit controller
BT Task running on core 0
MQTT connection...
[E][WiFiClient.cpp:232] connect(): connect on fd 59, errno: 118, "Host is unreachable"
failure_number
1152
failed, rc=
-2
Scan begin
MQTT connection...
[E][WiFiClient.cpp:232] connect(): connect on fd 60, errno: 118, "Host is unreachable"
failure_number
1153
failed, rc=
-2
Creating BLE buffer
device detected
47DEF4224401
BLErssi
-87
txPower
-59
BLE DISTANCE :
18.08
client not connected can't pub
client not connected can't pub

Hello,

Thanks for the extract, this is with the v0.9.3beta isn’t it?

If yes are you using wifimanager for entering your wifi credentials?

Yes, 0.9.3beta and yes, always using wifimanager for the first setup after a new flashing/upload

ok I will try to reproduce but maybe we should open another topic as it is a different issue.

If you prefer, ok, but the symptoms are the same, you experience that it just stops working

Anyway, I did try to reproduce and did it in the following way

  1. Start platformio, and run from serial monitor
  2. Everything is working fine, reporting to mqtt is working, seen in the terminal log
  3. Break the power to the AP for a while (a minute or two)
  4. You see in the terminal log that the ESP has lost wifi connection, cannot find the AP is continuosly reported:

client not connected can’t pub
client not connected can’t pub
*WM: [2] [EVENT] WIFI_REASON: 201
*WM: [2] [EVENT] WIFI_REASON: NO_AP_FOUND
*WM: [2] [EVENT] WIFI_REASON: 201
*WM: [2] [EVENT] WIFI_REASON: NO_AP_FOUND
MQTT connection…
[E][WiFiClient.cpp:232] connect(): connect on fd 57, errno: 118, “Host is unreachable”

  1. Turn on power to the AP again, the messages like above is not shown any longer, so it seems the AP is found again since only this message is repeatedly shown in the log:

[E][WiFiClient.cpp:232] connect(): connect on fd 57, errno: 118, “Host is unreachable”

  1. No re-connection to the AP is happening, you can wait forever
  2. To recover, restart the task serial monitor and everything is initiated from scratch and it starts functioning again

Hopes this helps,
Kind regards, Walter

Hello,

Thanks for this detailled report.
I’m suspecting that the BLEscan corner the antenna.
Here is a modification to test, so as to see if it correct your issue.
https://github.com/1technophile/OpenMQTTGateway/tree/wifi-reconnect-when-scan

This modification avoid the start of the BLE scan if MQTT is disconnected

Hello,
Did not help. I can see that “MQTT client disconnected no BLE scan” is written to log but the wifi client is not trying looking for the AP and reconnecting even if the AP is back

I’m also having problems that it freezes after a couple of hours/days depending on the used Scan_duration.
I’m using a m5 stack.

I now tried to use v0.9.3beta and the default 10s Scan_duration, and it worked fine for 5 days then it freezes again, and after manually re-powering the hardware it works for 5 hours and again freeze, so I see no visible logical behaviour.

when I do:
mosquitto_sub -t home/# -v
i get:
home/OpenMQTTGateway/LWT offline
home/OpenMQTTGateway/version 0.9.3beta

I tried:
mosquitto_pub -t home/OpenMQTTGateway/commands/MQTTtoBT/set -m ‘{“interval”:0}’
and then:
mosquitto_pub -t “home/OpenMQTTGateway/commands/MQTTtoSYS/set” -m ‘{“cmd”:“restart”}’

but nothing.
What else can/should I check?
Trying to leave it connected to a computer on serial for a few days, hoping it will log something usefull to diagnose is a bit complicated.

I’m using the mijia temp sensors to controll heating in the house, so when this freezes in the middle of the night it means we start freezing literally!!! :slight_smile:

It must be something I can change to make it work, otherwise my family will get quite angry with me :slight_smile:

I will test a different ESP32 board (devkit v1) to check if this is m5-stack related or not…

In the meantime I’m launching the stability tests of the current dev branch. I will keep you updated with the results.

May I suggest you to plan a dayly restart for such a “freezing” use case :wink:

I’m thinking that maybe my freezes have something to do with the fact that I have around 6 pcs of eq3 eqiva bluetooth heating valves, and OMG seems to be discovering them and somehow getting some data from them…because if I filter out my mijia temp sensors from mqtt messages I still get a bunch of stuff from the eq3:

home/home_presence/OpenMQTTGateway/id cc:b1:1a:1a:41:59
home/home_presence/OpenMQTTGateway/manufacturerdata u
home/home_presence/OpenMQTTGateway/rssi -89
home/home_presence/OpenMQTTGateway/distance 21.5
home/OpenMQTTGateway/BTtoMQTT/CCB11A1A4159/id cc:b1:1a:1a:41:59
home/OpenMQTTGateway/BTtoMQTT/CCB11A1A4159/manufacturerdata u
home/OpenMQTTGateway/BTtoMQTT/CCB11A1A4159/rssi -89
home/OpenMQTTGateway/BTtoMQTT/CCB11A1A4159/distance 21.5
home/home_presence/OpenMQTTGateway/id 4c:65:a8:d9:c6:3d
home/home_presence/OpenMQTTGateway/rssi -81
home/home_presence/OpenMQTTGateway/distance 10.5
home/home_presence/OpenMQTTGateway/id 00:1a:22:0c:74:f1
home/home_presence/OpenMQTTGateway/rssi -85
home/home_presence/OpenMQTTGateway/distance 15.1
home/OpenMQTTGateway/BTtoMQTT/001A220C74F1/id 00:1a:22:0c:74:f1
home/OpenMQTTGateway/BTtoMQTT/001A220C74F1/rssi -85
home/OpenMQTTGateway/BTtoMQTT/001A220C74F1/distance 15.1
home/home_presence/OpenMQTTGateway/id 00:1a:22:0e:0d:d9
home/home_presence/OpenMQTTGateway/rssi -92
home/home_presence/OpenMQTTGateway/distance 27.8
home/OpenMQTTGateway/BTtoMQTT/001A220E0DD9/id 00:1a:22:0e:0d:d9
home/OpenMQTTGateway/BTtoMQTT/001A220E0DD9/rssi -92
home/OpenMQTTGateway/BTtoMQTT/001A220E0DD9/distance 27.8
home/home_presence/OpenMQTTGateway/id 74:63:0b:c5:b9:df
home/home_presence/OpenMQTTGateway/manufacturerdata L
home/home_presence/OpenMQTTGateway/rssi -92
home/home_presence/OpenMQTTGateway/distance 27.8
home/OpenMQTTGateway/BTtoMQTT/74630BC5B9DF/id 74:63:0b:c5:b9:df
home/OpenMQTTGateway/BTtoMQTT/74630BC5B9DF/manufacturerdata L
home/OpenMQTTGateway/BTtoMQTT/74630BC5B9DF/rssi -92
home/OpenMQTTGateway/BTtoMQTT/74630BC5B9DF/distance 27.8
home/home_presence/OpenMQTTGateway/id 42:ee:70:fc:86:0f
home/home_presence/OpenMQTTGateway/manufacturerdata L
home/home_presence/OpenMQTTGateway/rssi -74
home/home_presence/OpenMQTTGateway/txpower 12
home/home_presence/OpenMQTTGateway/distance 5.3
home/OpenMQTTGateway/BTtoMQTT/42EE70FC860F/id 42:ee:70:fc:86:0f
home/OpenMQTTGateway/BTtoMQTT/42EE70FC860F/manufacturerdata L
home/OpenMQTTGateway/BTtoMQTT/42EE70FC860F/rssi -74
home/OpenMQTTGateway/BTtoMQTT/42EE70FC860F/txpower 12
home/OpenMQTTGateway/BTtoMQTT/42EE70FC860F/distance 5.3
home/home_presence/OpenMQTTGateway/id 00:1a:22:0c:76:41
home/home_presence/OpenMQTTGateway/rssi -82

Do you think this might contribute to the dying every X days?
If I add the eq3 addrs to the blacklist (is there something like that), or just add the mijia sensors to the whitelist, maybe it will be better ??

One more thing, before I was using with pBLEScan->setActiveScan(false) in order to save power, but then I changed it back to eliminate the chance that this could be causing the freezes. But I see that even with ActiveScan true it still does freezes.
I really prefer to save power in order to change batteries not so often (I got 6 sensors+6 valves around the house)

I just added my mijia sensors to the whitelist to filter out everything else, let’s see if this helps with the freezes.

(after few hours)

Ok, now it just logs this, without any Bluetooth sensor stuff:

home/OpenMQTTGateway/LWT online
home/OpenMQTTGateway/version 0.9.3beta

home/OpenMQTTGateway/SYStoMQTT/uptime 16800
home/OpenMQTTGateway/SYStoMQTT/freeMem 39140
home/OpenMQTTGateway/SYStoMQTT/rssi -68
home/OpenMQTTGateway/SYStoMQTT/SSID fmar
home/OpenMQTTGateway/SYStoMQTT/ip 192.168.1.18
home/OpenMQTTGateway/SYStoMQTT/mac 80:7D:3A:C8:28:4C
home/OpenMQTTGateway/SYStoMQTT/modules BT
home/OpenMQTTGateway/SYStoMQTT/uptime 16920
home/OpenMQTTGateway/SYStoMQTT/freeMem 39140
home/OpenMQTTGateway/SYStoMQTT/rssi -68
home/OpenMQTTGateway/SYStoMQTT/SSID fmar
home/OpenMQTTGateway/SYStoMQTT/ip 192.168.1.18
home/OpenMQTTGateway/SYStoMQTT/mac 80:7D:3A:C8:28:4C
home/OpenMQTTGateway/SYStoMQTT/modules BT

I noticed that the freeMem is way much lower than when it started (few hours before):

home/OpenMQTTGateway/SYStoMQTT/uptime 12240
home/OpenMQTTGateway/SYStoMQTT/freeMem 59680
home/OpenMQTTGateway/SYStoMQTT/rssi -66
home/OpenMQTTGateway/SYStoMQTT/SSID fmar
home/OpenMQTTGateway/SYStoMQTT/ip 192.168.1.18
home/OpenMQTTGateway/SYStoMQTT/mac 80:7D:3A:C8:28:4C
home/OpenMQTTGateway/SYStoMQTT/modules BT
home/OpenMQTTGateway/SYStoMQTT/uptime 12360
home/OpenMQTTGateway/SYStoMQTT/freeMem 58840
home/OpenMQTTGateway/SYStoMQTT/rssi -66
home/OpenMQTTGateway/SYStoMQTT/SSID fmar
home/OpenMQTTGateway/SYStoMQTT/ip 192.168.1.18
home/OpenMQTTGateway/SYStoMQTT/mac 80:7D:3A:C8:28:4C
home/OpenMQTTGateway/SYStoMQTT/modules BT
home/OpenMQTTGateway/SYStoMQTT/uptime 12480
home/OpenMQTTGateway/SYStoMQTT/freeMem 58588
home/OpenMQTTGateway/SYStoMQTT/rssi -66
home/OpenMQTTGateway/SYStoMQTT/SSID fmar
home/OpenMQTTGateway/SYStoMQTT/ip 192.168.1.18
home/OpenMQTTGateway/SYStoMQTT/mac 80:7D:3A:C8:28:4C
home/OpenMQTTGateway/SYStoMQTT/modules BT

when I tried:

home/OpenMQTTGateway/commands/MQTTtoBT/set {“interval”:0}

it just went:

home/OpenMQTTGateway/LWT offline

and then nothing

What the hell could be wrong with it that it stopped logging BLE data?
At least it’s not freezed like before I added the sensor addrs to the whitelist…

Any ideas what to try next? I’m running out of ideas/hope that this can be a reliable option to extend range of bluetooth mijia sensors… :frowning:

That’s an interesting track, thanks for pointing it.

Instead of that could you try a restart to see if you recover the scan availability and the memory at the same level as start?

In my side I’m monitoring 2 ESP32 to see if I reproduce the same behaviour

I could connect the charger through a sonoff or similar and automate a restart if the last timestamp on received temp/humi data is older than a few minutes… i.e.: reboot on freeze.

I would advise more a preventive restart every day by firing from your controller a restart command to MQTT (v0.9.3beta new function)

Does the restart clears the whitelist?

If the white list command is not published with a retain flag, yes. If you did it with a retain flag you should keep the white list. This should be added into the docs ; -)

I tried the restart when it froze again, and it seems to work. BT is again logging sensor values and the freeMem goes upto ~ 50k (from 39k when it freezes).
Any idea what might be causing the BT to stop responding? are there any known ESP32 BLE lib issues?