ESP8266 and Lost WIFI Connection

Something those of you planning to use your ESP8266 units in remote installations might want to be aware of.  I’ve been working with TUAN who developed the MQTT software – now, I’m sure it has nothing to do with his code… but essentially, I’m using his latest software as the basis of a controller.

I’ve added a simple interrupt driven real time clock, refreshed by occasional MQTT message, I control output 0……GPIO0 – and I have a temperature sensor on GPIO2.

All of this works VERY nicely (some new updates from last night you might want to get from the GIT repository) – when the temperature drops below a certain level the output comes on etc, or I can manually turn the output on and off.  I can even store settings in FLASH having added a little section to the area that normally holds WIFI settings – all of this works perfectly.

BUT.. has anyone tried turning off the WIFI for a while…. and then turning it back on? Does your little board reconnect every time, reliably? Because in the real world of remote control that will happen.  I am finding that this is not always the case, that the code sits and tries to connect, maybe even seems to but ultimately fails. If the ESP SDK comes back with “STATION_CONNECT_FAIL” just what exactly should yo do about it?

Most of the time, simply disconnecting the ESP8266 board sorts the problem – if not the first time, the second time (and that in itself is a worry) – but that is no good if the board is actually controlling something – you can’t just reboot the board, you need to somehow reboot the WIFI while maintaining control over whatever it was you were doing…. or find another way to ensure that the WIFI reconnects every time.

The alternative is to use the board with an Arduino and have the Arduino reset the ESP8266 in the event of communication failure – but that’s really a bit of a cop out.

Thoughts?  (this is for C programmers, we’re not talking about LUA though I’m sure that is also worth testing).

Thanks to input here I’ve asked Tuan if it’s possible to update the code using the new 0.9.5 SDK + patch… and we’ll try again!

59 thoughts on “ESP8266 and Lost WIFI Connection

  1. Ajith, there are no dumb questions EVER. Period. TBH I think the problem lies in the secure wifi comm.. in case you are using WifiClientSecure you will come across this problem i guess, thats where ssl_read() is actually implemented. As far as I know the settimeout() is provided only for http client, not for https, so there seems to be a void over there. See the method here https://github.com/esp8266/Arduino/pull/1312/files.

    Somehow I can feel that there is something wrong with my implementation, but..
    I have a question for Peter -> have you ever tried the method I mentioned, connecting your ESP8266 via 3G/4G hotspot and then trying the MQTT client for stability by turning mobiledata/hotspost on and off and see if the MQTT client recovers everytime? Honestly thinking of shiting to MQTT from REST over https.. :sigh:

    1. Hi Praneet. The subject of reliability has taken up much of my time and hair. Currently, with my own software, should the WIFI be lost, it will after a while set up an alternative access point, reboot and try that – it will go back and forth forever. If the MQTT is lost, it gives it several attempts and then reboots – I’ve not found a way out of that but at least the current solution works. I’ve tried all sorts of weak links, turning MQTT broker on and off, turning WIFI access point on and off. At this point I’m fairly confident it will recover from all situations. Indeed the only thing I’m having trouble with right now is OTA – it seems on talking to Richard who developed RBOOT that the current last 2 SDK releases, 1.51 and 1.52 have broken the OTA – but right now even on 1.50 it is crashing on me. Sigh. I’ve not done anything with https: yet.

      1. Hi Peter.. many thanks for explaining your setup. I think I am going for MQTT in my new project.. so i will defintely find out the ups and downs of that.. for https, there seems to be no other option but to reboot (i have been doing that all this while, just like after your “several mqtt attempts”)… just that, in https there can be no several attempts, just the one failed transmission and then must reboot. 🙁 🙁

        I have never dealt with the OTA feature. Must try this actually. 😀 😀

  2. A bit late for whatever I am going to say, since I have been playing with this for quite some time now.. but I am hoping inputs are always welcome. 🙂

    Point being that it in search of a dependable “recover-from-no-wifi-or-no-internet” solution, I always try in one simple way to check if the code can handle these situations. Turn on my phone wifi hotspot and connect my ESP to that.

    This presents two possibilities.. I can turn off/on Mobile Data or Mobile Hotspot at will and randomly which is basically same as No Internet or No WiFi situations. Quickly turning these off/on at random, especially if you are aware if the ESP is making an http request at that point, will quickly cause the module to undergo WDT reset. Plus the longer 3G/4G latencies makes a good test case as well.

    This phenomenon is especially noticed if the transaction involves SSL or https. I have drilled down the problem to the function ssl_read() deep in the sdk. I am guessing this function never returns if Internet is lost at the time of making the handhake, but i am at a loss as to how to provide for a suitable recovery.

    Right now, before making a call, I arm a timer ISR which will run some function (maybe check for WiFi or Internet) if timeout is reached, and if timeout is not reached and request is completed, the ISR is disarmed. Even then, sometimes, the next time the ssl_read() is invoked, the module will soft wdt reset. Not always. just sometimes.

    Any ideas about this anyone? ANY input will be highly appreciated… really dont have a solution to this right now…

    1. I don’t have anything of value to add, but I believe
      >> A bit late for whatever I am going to say,…
      is not true as long as this issue isn’t resolved. I, for one, am still in search of that holy grail.
      It is great to see you’ve arrived at a probable culprit. I’m sure you must have tried it, but isn’t there some kind of timeout setting that can be set in a http(s) client? Please ignore this if this question seems dumb. I have no idea about the SDK internals.

  3. BTW, Pete, since our discussion several months ago, I’ve transitioned to C code (using Arduino IDE) and I am finding it to be much more stable (thank you for encouraging me!).
    I’ve written a simple program to work in SoftAP mode first time, accept my router SSID and password via query parameters and store the same in the EEPROM of the ESP. On next restart, the ESP uses the credentials from the EEPROM to connect to my home router in STA mode. It then listens for HTTP requests to turn ON or OFF specific GPIOs. This has been working quite well across a few days. I haven’t been able to test it for longer periods, though.

  4. Hi Abhishek,

    How do you detect that a reset is required? Do you call the reset code from within the while loop waiting for data?
    Isn’t the while loop supposed to wait indefinitely?

    1. I no longer ever use the reset as the WIFI and MQTT always recover. It is there just as a last sanction if connections simply fail and do not reconnect but to my knowledge that has not happened in months.

  5. Hi Zvika,

    Sometime back I faced a strange issue where after ESP.reset() used to perform soft reset did not worked perfectly. The ESP was not able connect to the same router reliably. So I tied GPIO16 to RESET pin. I found this in one of the Blogs on tech.scargill Blog itself. Whenever a reset was required I used the following:

    RESET_PIN = GPIO16;
    pinMode(RESET_PIN, OUTPUT);
    digitalWrite(RESET_PIN, LOW);

    The code above will ensure a complete reset done by the software.
    This has worked for me with 100% results.

    I have seen ESP hanging in software loops where we wait till content is available:
    while (!client.available()) {
    }

    This got resolved by taking care of the same in software.

    -Abhishek

    1. Thanks Abhishek.

      Seems that the ESP does not regain from lost signal… can simulate it by connecting to a wifi and then cutting the radio signal… never comes back ..

      The only solution i found was issuing:
      ESP.restart();

      1. Just sticking my nose in here – if you have to reboot because WIFI is lost and not coming back – there is something wrong with the software, it is my extensive experience that the WIFI when properly setup is bulletproof. Some time ago I took the nuclear option out of my code as it was simply never necessary to reboot… Something tells me there’s an SDK call to ensure the WIFI does restart… might be worth checking the SDK doc…

        1. I suspect you are right,
          alas, i’m using the arduino SDK and most of the management is done under the hood …
          Checked few times,if WiFi router is stopped my ESP does not regain.

  6. Wow … I thought that i’m doing something wrong… my ESP8266-12 tends to loses control every few hours …
    I was looking for a way to boot it by software and found your interesting discussion.
    I use Ardiuno SDK … and thibking to apply watchdog in order to boot the system.

    1. Rebooting is not the way. Finding out what is wrong.. Is there a memory leak over time for example.

      1. You’re definitely right.
        but that may be the shortest solution…
        I use Arduino IDE. I suspect it will be hard to fined leeks

        1. I have awesome experience with esp12 and arduino ide. My systems have stayed ON for last 15 days and seems to be pretty reliable. Am using this set up for last 7 months.
          Am using GPIO16 for triggering soft resets.
          Also heap memory goes low and keeps recovering.

          1. Yes, Devkit 12 is awesome …
            I do get system hangouts on daily basis.
            could you elaborate more on the soft reset ?

  7. Hi Peter, I read your blog with great interest. You and most others are clearly way ahead of me in your experience and knowledge of programing and the 8266. I just recently got interested and did some playing with the ESP-7 with reasonable success. I decided to go for a couple ESP-12E modules, and flashed one with nodemcu today. When I explored the ESP-7’s workings I found the module would run for days on end with out interuption. The ESP12E I flashed today has had to be restarted at least 4 times. I have a wifi scanner running on a laptop that scans every 2 minutes for connections and see it drop off. This is with it idle for long periods. I never saw that with the ESP-7. Just wanted to pass that on, and wondered if you had any thoughts on it.

  8. Okay! You mention compiling Lua code. But the process I followed (from a lot of Youtube videos, no less!) did not involve me compiling any code. I just push Lua code (human-readable source code) via USB-Serial into the ESP, name the file init.lua and lo and behold, it gets executed on reset/reboot of the module. So I’m guessing the code I pushed is merely getting interpreted by the firmware. So, correct me if I’m wrong here, I guess the firmware should then be (re-)built using the latest SDK. The firmware I used was the “stable release” 0.9.5_20150318 from here: https://github.com/nodemcu/nodemcu-firmware/releases
    I haven’t thought about building the firmware from scratch yet. Is this what you meant when you said I should try compiling against the latest SDK?

    I’m sorry if some of my questions seem stupid.

    1. Ah, you bought a board preconfigured with Lua…. Yes, latest… but – if you grab the latest code – and put it into the “alternative” setup eith the Eclipse editor – you can indeed compile with the latest SDK 1.3.0 which brings it up to date…. I must stress I’ve not done long term tests with this but I did notice that more RAM was available than previously. Yes, that’s what I meant.

  9. Thanks, Pete for the detailed reply.

    I’ve just begun playing with the ESP8266, and hearing what you, a veteran in things ESP8266, if I may say so, have to say about NodeMCU, is disheartening. It is very easy to use … it is a pity.

    Thanks for the programs74.ru link. I shall try it and see if I can get the same level of reliability that you have obtained. I hope it is not difficult to upgrade the official SDK to v1.3.0. The default one bundled with this “Unofficial” SDK right now is v1.2.0_15_07_03.

    My setup has a UPS and continuity of power is not an issue. So it would suffice for my needs to just “work endlessly” 🙂

    1. Oh, don’t let me put you off Ajith, firstly my experience of Lua is not too recent – try compiling against the latest SDK 1.3.0 and make sure you have the latest version. So many issues come from using old code – remember this is all fairly new and the original codes were full of bugs. I see people now writing stuff in Lua without apparent complaint. I just happen to feel safer using C in which I have a LOT of control over what’s going on. Upgrading – no, it’s a snap for all but a small number of programs – and the changes are well documented. You simply close down the environment – rename the c:\espressif\esp8266_sdk directory and dump the new one in. I keep the older directories (renamed) just in case – but I’ve never had to use them. UPS is good. In my case, I have a PI2 on a UPS but putting battery backup on everything else is impractical.. the router for example… though thinking about it I COULD battery back that up… hmm.

  10. Hi Pete,

    I’m using NodeMCU on the ESP 8266 12E and have similar WiFi instability as you describe. I have a Lua web-server running fine in the evening, controlling a few lights via GPIO, and in the morning the next day, I find the WiFi (Station mode) disconnected and no longer responding to HTTP requests.

    Do you think it could be some kind of memory leakage resulting in an “OutOfMemory”, halting the module? In that case, I could keep track of the heap (I remember seeing Lua API for getting the free memory) in a timer, and reset the module programmatically when it reaches a threshold. I’m presently at work and will try this out once at home.

    -Ajith

    1. Hi Ajith

      In the case of Lua – and with no disrespect intended to the writers of that software, it was always my experience that ultimately I would hit a memory leak. Recently there has been more RAM available – which delays things a bit – and I’m sure they’ve made improvements to the software – but of course there is a way to find out – if I remember rightly – there is a command something like node.heap() (don’t hold me to that) which when used with print – reports the heap size – you could set that to run regularly and see if there is any regular drop in heap size. The heap also takes time to recover – a few seconds as I recall. I just noticed this article written by Dave St Aubin – http://internetofhomethings.com/homethings/?p=424 – I could have written this myself as I echo his sentiments – except that, in addition I have chosen to support programming in C in Windows using this… http://programs74.ru/udkew-en.html except that I have upgraded the official SDK to version 1.3.0 and I would say I am ALMOST there… that is, in the absence of power cuts, my boards work endlessly. It is only occasionally that I have problems and I’m working hard to get to the bottom of them – but the issues are nothing like those I used to have with Lua. As for the ESP8266 Arduino environment – I’ve met the author, great guy it’s a great package BUT I can reliably reproduce issues with that and I’ve left issues on his repository. For now I’m sticking with the devil I know.

    1. Thanks Bill – I don’t use the Arduino IDE – very happy with the simple C interface in Eclipse – which I’m warming to by the day… I definitely think (in hindsight) that regular checking that the board is not talking to thin air is essential 🙂

  11. Boy, am I glad I found this site. It’s really difficult to find anyone knowledgeable on the 8266.

    I’m having a similar problem but slightly different. Though it sounds like it might be the same thing Hugo described above. I’m using nodeMCU and running a small HTTP server to control a relay. From time to time (no patter has emerged) the device will stop responding to HTTP requests and pings. When this happens, the device is also missing from my network map (using the Fing app on my Android device). When this happens, I can do one of two things.

    1. Wait. It will eventually fix itself. Not sure exactly how long this takes. Definitely measured in at least minutes. Maybe more.
    2. Scan the network using Fing twice, sometimes three times. On the second or third time, the device will reappear in my network map and work as expected.

    I have tried scanning a network map with Norton Anti-virus (on my PC) but that doesn’t have the same effect as the Fing software does. And, like Hugo, I still have an IP address and the 8266 responds to Lua commands as it should. It’s just not visible on the network.

    Help!

  12. Hi Peter, Thanks for sharing your knowledge and hard work with everyone.
    I have done a setup similar to yours and it sends DHT22 sensor data to my MQTT without problem for few days and then it stops working. How did you solve your problem? I have updated to SDK 1.0 and still having the same problem.

    1. Hi there Sean… is the “few days” repeatable? Is it the ESP that stops working or the other end? If you are on SDK 1.0 I seriously recommend moving to SDK 1.12 with the sleep patch at the Espressif site.

      1. Yes it happened for the fourth time yesterday. ESP disconnects from out WiFi router. when I scan the Wireless lan, I no longer can see ESP’s IP address and our WiFi router also does not show it as connected anymore. I have to manually reset the chip in order to make it re-connect back to wifi.
        I will update the SDK and get back to you with the result.

        1. Ok initial thoughts and I could be miles out…. update to latest SDK. Test. Check power – you need 3v3 at least 300ma available – pref more as power goes up and down as WIFI operating…. and it should not be switched. If using switched – consider 5v switched and 3v3 linear. If using a PC for power – is there any chance PC power is glitching or timing out – ALL my boards have 3v3 linear regulator….

          Hope that helps.

          1. I gave up on this Pete. I was working on Martin’s project. He has combined esphttpd and mqtt together and calls it ESP8266_Relay_Board. His code does not compile with any SDK newer than 1.0.0. I receive bunch of error while compiling. and it seems he is not keen to update his project. I just wasted my money by buying his product.
            I have 700mAh 3v3 power and everything is fine but it disconnects after a few days. I think you are right. The problem is the SDK which I cannot update sadly. I have to look for a reliable MQTT+ESPHTTPD project.
            Thanks for your help and sharing your knowledge.

  13. Not sure if my problems are the similar to the ones you described but observed something strange which I guess only can get solved by Expressif. I’m currently using the web-server code to develop a remote control unit and a switch on basis of ESP. I can configure both devices independently to make run both independently and actinc to their assinged role. Configuration ist stored in Flash 0x3C. I’m configuring the devices from my PC which has of course a different IPas the both devices. After having configured the both ESP and they got their IP DHCP address the control can send requests to the switch device and the LED attached to it whents off and on as it should. This works pretty fine and stable. If I after some time want to reconfigure one device then I cannot access it from my PC but requests from the control work fine (LED continues to switch off an on). Lucky me, I have the switch unit running in STA+AP mode I change the network and access the switch unsing the default IP 192.168.4.1 and it answers and surprisingly it also answers to requests of the IP adress it got from its Client mode.
    After switching the device wifi mode from STA to ST+AP, it answers again in the original network.
    To me this means once the device is frequently used the connecton is stable, if not it just mixes up with the different IP-stacks it has.

    1. Looked around and found STA mode is confirmed to be unstable and is already explained in this article http://www.esp8266.com/viewtopic.php?f=6&t=1633

      In case one uses code from the webserver example please comment out // wifi_set_opmode(1); it is in the resetTimerCb this avoids module reset to STA after reconfiguration.

      Now we wait for Espressif to fix it.

  14. Pete
    Having some kind of watchdog (hardware or software) with ability to reboot the unit is always something to consider in an embedded system.
    Which means that the system (hw+sw) should be designed to support such transient phases without impacting the security.
    For example outputs driving an hardware shall have pull-up/down resistors as needed to guarantee a safe status during the reset, or a mechanism so that the hardware keeps the same value despite the reset (like external D-latches for examples).
    Cheers.

    1. Oh yes I agree, for very occasional failures of course the watchdog is essential. In my Arduino projects I use the 8 second watchdog and I can count on one hand how many times it has actually been used (I use a variation of the normal reset whereby the first trigger of the watchdog calls a routine to log the error in EEPROM then immediately resets the board – so I know the difference between a power off and a watchdog reset). You’re always going to get glitches, power failures etc but in this case, the software is not recovering from WIFI power loss and I’m eager to know if this is a generic problem with the Espressif SDK or a mistake in my own code or the MQTT code. Sadly I suspect that most people just messing with the boards (and they are fairly new after all) will not yet have done such rigorous testing as yet – if anyone has I’m keen to hear their results.

      1. I think it may have something to do with the Espressif SDK. I’m using NodeMCU to drive a relay. Nothing fancy; just listen to port 80 and match on the address. It works for minutes if I keep calling it, but once I stop using it for a while (like for 2 minutes), it just stops answering.

        Asking for it to print the IP shows it still has one; but I can no longer see it in the my router list.

      2. I’m waiting for an ESP-03 to arrive and looking forward to joining you all. Does the Wi-Fi protocol not have a function to test a connection to an access point nor a reset() function when the connection fails? Is this something that can be handled at the protocol level instead of at the hardware level?

      3. Hi Pete. I share you experiences with instability. Apart from messing around with the boards (obviously) I am now trying to setup a permanent sensor network. I’m using the eclipse C setup, Tuan’s library and DS18b20 sensors. I’m logging all temperatures to a raspberry pi running mosquitto and graph it with rrdtool. Everything works smoothly but every once in a while the esp’s just stop and don’t ever reconnect.
        I’m looking into a watchdog solution as well but I’m hoping for a little more stability on these boards in the near future…

        1. When you compile Tuan’s code – are you using the latest 0.9.5 SDK? I’ve now had an ESP-01 board running for several days while I’ve been messing with the Raspberry Pi 2 and it’s not failed. Ok, it’s in a stable environment…

          1. Yep. I’ve been running 6 boards independently for a while. Everything compiled with sdk 0.95 (not completely sure about the patch though, I’ll check). 4 of the sensors have been stable for a week now. 2 failed after 6 days (not exactly at the same time) and needed a reset.
            It has improved with 0.95 but there still are some issues under the surface.
            (I wouldn’t trust on the sensors to run stand-alone in spain just yet :))

            1. All such updates are gratefully received – you would think by now that ESPRESSIF would have all of this cracked. So much riding on the accuracy and reliability of their SDK…

              Of course another way, yet to be implemented but I understand there is work going on, is to have a better interface than the AT interface and put the WIFI and say MQTT onto the ESP8266 while having an Arduino in charge. If the ESP8266 were to fail, the Arduino could simply reset it. Basically if you were to take the current MQTT code on the ESP8266 and add a fully flexible serial interface (right now only serial out is implementeD) – you’d have the best of both worlds and the Arduino could handle things like serial LEDS etc who’s calls are blocking – while the ESP8266 continued on it’s way – hopefully using the available RAM in the ESP8266 to buffer any outgoing data.

    1. Thank you – that’s really great. Here’s me thinking no-one was commenting and there are loads in here. Somehow emails from the new blog (which had the old blog copied into it) are ending up in my Spam filter – easily fixed but a little bit of a worry.

    2. Got it – great – thanks so much… looking forward to getting some time to sit down and play with extra port bits.

  15. A few lost comments here methinks, but looking good.
    If you want to send me an address, I’ll post you a 0.1″ headed beasty to play with. I think I owe you a drink!

    1. Yes including my last reply – oh well I was warned that upgrading could have teething problems but it had to happen – WordPress were putting adverts into posts etc… ok try again – my address is Willow Cottage, Wark on Tyne NE48 3LB, UK. Here’s hoping you get this.

  16. Ok I’ve asked the question – it is always possible that the SDK is responsible for reconnect issues… I’m sure Tuan will look at this – I’ve just emailed him.

    1. I like the new blog.

      I too had connection issues and see this post http://bbs.espressif.com/viewtopic.php?f=5&t=154
      I know you’re not a linux guy, but this program https://github.com/pfalcon/esp-open-sdk has made keeping the sdk up to date super simple in linux. Once I updated the SDK, my LUA compile seems much better. My send count is at the highest point without connection loss.

      I’ve installed the updated SDK by pfalcon (SDK 0.9.5 final + patch1) …so far so good.

      1. Hi

        Thanks for the kind words. I THINK I’ve added in the new SDK. It’s hard to tell as, well, nothing happened. I grabbed the latest SDK from the site you mention (but I already new about this and did this last night). I patched the two files as per their instructions – then over-wrote the files in /espressif/ESP8266_SDK. I don’t know what I was expecting but when I ran, in ECLIPSE the CLEAN and ALL options, the files compiled as they have always done with no visible difference. That’s GOOD as it means no code changed but I’m wondering if there is any extra step I missed out! The code works of course – but if I turn the WIFI off for any length of time and then back on, the MQTT queue has built up and it does not always recover – which as it stands of course makes it useless for a real application. Of interest, my previous work using Arduino, Ethernet card and the little NRF24L01 radio cards in a network, runs in 3 properties night and day including one that I can only access when I’m in Spain… and they have never failed. I was last over in Spain at the end of August and regularly check in to see what the temperature is, control the lighting etc and despite rubbish electricity and hence WIFI failures, it just “works” – that’s what I need to get out of this project before I can deploy it – I’m sure others will similarly need reliability like that. I’m just hoping this can be resolved without having to resort to watchdog timers etc.

  17. I am working in the ECLIPSE SDK using Tuan’s MQTT code… I am assuming we’re using the 0.9.4 SDK – I’ll ask as I’m not confident enough to know what to drop in to update the SDK. Good point worth looking at.

  18. Which SDK version are you using when compiling your firmware? The latest one (0.9.5 + patch) ?

  19. I missed it he first time around…there IS an update a few hours old in the middle somewhere./
    https://github.com/tuanpmt/esp_mqtt – See MQTT – Tuan has mis-spelled my name and it says “Scragill reported”. I’m using this code…. with some additions for timing and on-off etc. What I’m finding is that it’s perfect – spot on – until you disconnect and reconnect WIFI…. can’t test this in seconds – needs time – but it does seem to fail often on reconnects… but I would love to hear experiences of others…. it’s always possible it’s something I’ve done to my code.

  20. That’s strange, I cannot see any esp_mqtt updates in past 24 hours. There are updates in nodemcu but I assume you are not talking Lua code here.

    1. Again strange, I found it by going up from the esp_mqtt git page to Tuan’s page and then clicking on the esp_mqtt link. Low and behold an update 9 hours ago.

Comments are closed.