Automatic erase of flash when wrong credentials entered first time

Just to continue this subject here:

@ggggh

Currently if the MQTT credentials entered by wifimanager are bad at start the gateway will erase its memory to enable the user to reenter it correctly.
The counterpart of that is that sometimes when there is wifi/power outage the ESP erase the memory and never reconnect.

Sometimes simpler is better, I think we could remove this function as it is causing more troubles than advantages.
I’m saying myself that if the user wants to reflash the ESP it may do it with the ESP flash download tool or platformio or Arduino IDE quite easily.

I started this topic to share users opinion on that.

Personally I would suggest this, but I understand it may just fit my use cases better:

  • For wifi connectivity issues: Cycle between attempting to connect to primary wifi settings, secondary wifi settings (a second set of credentials), and ad-hoc wifi - try each of the three every x minutes.

  • For mqtt ip/credential issues: if they´re wrong, do nothing. You either need to fix the mqtt broker (in which case the ad-hoc wifi doesn´t help), or configure the mqtt settings in the portal. The device is reachable on the internal network anyway on it´s ip, so why create the ad-hoc wifi if it´s easier to just log into it through the working wifi connection to configure the mqtt settings? (it also allows working on a remote device, where you´ve vpn’ed to the wifi and cannot reach an ad-hoc wifi)

I´d still allow resetting values if holding the reset button for x seconds.

The issue is that with the approach I describe if you enter wrong mqtt settings the gateway will fall in an infinite loop.
The way you have to reenter the mqtt settings are the following:

  • Erase flash with an ESP Tool
  • Erase the flash with PIO or Arduino IDE
  • Push the TRIGGER_PIN for a long time if you have one button

Maybe it is enough, don’t you think?

I might be missing something - why does it enter an infinite loop with wrong mqtt settings? Would that stop the device from serving the config portal? If so, can we not make it incrementally back off the mqtt connection attempts for periods of time, allowing you time to reconfig through the portal? For example: if after 10 connection attempts it did not connect, back off for 1 minute and then retry. If 10 more, avoid mqtt connections for 2 minutes. If 10 more, avoid mqtt connections for 3 minutes…

We may serve the config portal, but we must keep in mind the case of the broker stop.
Serving the wifi portal maybe a security issue if the broker is stopped.

We may serve it only if the broker was never connected.
Corresponding in this case of serving the wifi portal instead of erasing flash (current parameters).

Unfortunately serving the config portal after wrong mqtt credentials generate a core dump :frowning: .
Seems to be related with the use of preferences.

For the moment I will remove the automatic reset per default on V0.9.4.

Later on I’m going to study if

may be a better solution for handling the network credentials input. Note that this solution supports having several wifi networks configured.

Why would it become a security issue if the config portal has a password and can only be entered with it? My thinking is, if you bring down the broker, a device trying to connect to it is not in a better state than one cycling through attempting to connect or offering you a password-protected ad-hoc portal.

By the way, just had the disconnect issue again.
There was no wifi drop - just had the broker stop for a few seconds due to a docker update.
I noticed that the 0.94 device dropped into ad-hoc portal config mode with that, while the 0.93 is still running fine.
On accessing the portal, wifi settings need re-entering. Mqtt settings are fine as they were uploaded hardcoded originally.

Interesting, I will do the test to replicate the behaviour.

I don’t think a lot of people are changing the wifi manager portal password, that’s why I think it is not secure.

I hope autoconnect will enable to enter the MQTT credentials without erasing the flash.

I have simulated a broker stop from from a few seconds to several minutes and each times the gateway reconnected instantly.
Are you able to reproduce it by stopping and restarting the broker?

Test 1

  • Paused (not stopped) the docker container for the mosquitto broker
  • Could see v0.94 react in a few seconds by reconnecting to the wifi (visible on the router’s log) (I assume when it tries to send the next mqtt message and can’t connect)
    It does not throw the portal page, nor does it have any other reassociation activity on the router during the next few minutes.
  • After a few minutes I restarted the docker container, and v0.94 started pushing mqtt messages again

Test 2

  • Stopped the docker container for the mosquitto broker
  • Could see v0.94 react in a few seconds by reconnecting to the wifi (visible on the router’s log) (I assume when it tries to send the next mqtt message and can’t connect)
  • After about 1 minute v0.94 deauths from the wifi (visible on the router’s log) and brings up the ad-hoc wifi and portal page
  • After a few minutes I restarted the docker container
  • v0.94 does not try to reauth on the wifi, and stays on the ad-hoc wifi
  • Power cycled v0.94, but it does not try to reauth on the wifi - throws wifi portal again
  • Entered portal, introduced wifi ssid and password, and all working again.

During both scenarios, none of my other devices (including v0.93) reauthed or disassociated from the wifi network due to the broker being down.

The fact test 1 and test 2 showed different results may be just random, or how pause vs stop manages network connections to the port - which I’ve found no documentation for.
In any case, I think we can take test 2 as the more realistic one.

Note again I’ve hardcoded mqtt settings when uploading firmware, but not wifi.

Thanks for the details, I will reproduce with those.

Could you try the step 2 with this branch please :
https://github.com/1technophile/OpenMQTTGateway/tree/remove-auto-erase?files=1

Sorry it took me a bit - had some ArduinoIDE trouble…
Tried now finally and Test 2 behaved like Test 1 - it did not disassociate from the wifi, so good result I believe.
I did notice something when I watched the behaviour on Serial Monitor.
Upon stopping the broker, I get:

23:24:41.728 -> W: MQTT connection...
23:24:41.728 -> W: failure_number_mqtt: 1
23:24:41.728 -> W: failed, rc=-2
23:24:46.713 -> W: disconnection_handling, failed 1 times
23:24:46.713 -> W: Attempt to reinit wifi: 0
23:24:46.748 -> W: ESP32: Forcing to wifi 0
23:24:46.782 -> Guru Meditation Error: Core  1 panic'ed (Cache disabled but cached memory region accessed)
23:24:46.782 -> Core 1 register dump:

It then reboots and is fine, and keeps trying mqtt connections until I brought the broker back up again. It immediately started transmitting then.

This panic and reboot probably explains the re-auth on the wifi which we see on both tests, and which doesn´t happen on my v0.93 (which has only pilight enabled). Could the Panic be related to BT or is it rather a v0.94 thing?

Good, we are making progress

Interesting, and not expected there. May you change the log level to TRACE:

#define LOG_LEVEL LOG_LEVEL_TRACE

I hope we will have more details by enabling more verbose debug

No problem, and thanks for helping!

Actually, not much more, except also caught another Panic this time after reconnection to the broker. Maybe just my ESP32 hardware not doing great, or power issues?
Would the core dump registers be of any use? Guess might be best if I just try this first on another ESP unit - no sense wasting your time if it´s a random hardware issue.

In any case:

Case 1, when broker disconnected:

    00:07:02.004 -> W: MQTT connection...
    00:07:02.038 -> W: failure_number_mqtt: 1
    00:07:02.038 -> W: failed, rc=-2
    00:07:07.027 -> W: disconnection_handling, failed 1 times
    00:07:07.027 -> W: Attempt to reinit wifi: 0
    00:07:07.027 -> W: ESP32: Forcing to wifi 0
    00:07:07.027 -> Guru Meditation Error: Core  1 panic'ed (Cache disabled but cached memory region accessed)
    00:07:07.027 -> Core 1 register dump:
    ....
    ...
    00:07:07.129 -> Rebooting...

Case 2, after connecting broker back again:

00:08:14.963 → E: Failed connecting 1st time to mqtt, you should put TRIGGER_PIN to LOW or erase the flash
00:08:14.963 → W: MQTT connection…
00:08:14.963 → N: Connected to broker
00:08:14.963 → T: Subscription OK to the subjects
00:08:15.408 → N: Scan begin
00:08:15.750 → Guru Meditation Error: Core 1 panic’ed (Cache disabled but cached memory region accessed)
00:08:15.750 → Core 1 register dump:

Yep, it would be interesting to try with another board, to confirm or contradic the case.
Thanks