Skip to content

mDNS fails every on asynchronous thread - timing / synchro bug? #3263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
philbowles opened this issue May 19, 2017 · 4 comments
Closed

mDNS fails every on asynchronous thread - timing / synchro bug? #3263

philbowles opened this issue May 19, 2017 · 4 comments

Comments

@philbowles
Copy link

Basic Infos

Hardware

Hardware: Wemos D1 Mini
Core Version: ?2.1.0-rc2?

Description

mDNS fails every on asynchronous thread - timing / synchro bug?

Settings in IDE

Module: NodeMCU 0.9 (ESP12 Module)
(same behaviour with WEMOS D1 Mini)
Flash Size: 4MB/1MB
CPU Frequency: 80Mhz

Upload Using: SERIAL

Sketch

#include <ESP8266WiFi.h>
#include <ESP8266mDNS.h>
#include <Ticker.h>

#define FAILURE_LIMIT 824
int start;
Ticker async;

void WiFiEvent(WiFiEvent_t event) {
    switch(event) {
        case WIFI_EVENT_STAMODE_CONNECTED:
            Serial.printf("T=%d (T0+%d) [WiFi-event] WiFi Connected SSID\n",millis(),millis()-start);
            break;
        case WIFI_EVENT_STAMODE_GOT_IP:
            Serial.printf("T=%d (T0+%d) [WiFi-event] WiFi got IP\n",millis(),millis()-start);
            break;
        case WIFI_EVENT_STAMODE_DISCONNECTED:
            Serial.println("[WiFi-event] WiFi lost connection");
            break;
        default:
            Serial.printf("[WiFi-event] some other code %d\n",event);
            break;
    }
}
//
//    setup
//
void getMDNS(){
  Serial.printf("T=%d (T0+%d) call MDNS\n",millis(),millis()-start);
  WiFi.printDiag(Serial);
  MDNS.begin("IRRELEVANT");
  if(MDNS.queryService("cheztoiiot", "tcp")){
    Serial.println("It Works");
    Serial.println(MDNS.hostname(0));
    Serial.println(MDNS.IP(0));
    Serial.println(MDNS.port(0));    
    Serial.println(MDNS.txt(0,"basename"));
    Serial.println(MDNS.txt(0,"version"));
    }
  else Serial.println("It Fails");   
}

void asyncMDNS(){
  if(WiFi.status() == WL_CONNECTED){
    Serial.printf("T=%d (T0+%d) If THIS is not connected, what does WL_CONNECTED mean?",millis(),millis()-start);
    getMDNS();
    }      
}

void wifiGotIPHandler(const WiFiEventStationModeGotIP& event){
  Serial.printf("T=%d (T0+%d) wifiConnectHandler %s\n",millis(),millis()-start,(WiFi.localIP().toString()).c_str());
//
//  SECTION 1
//
//  async.once_ms(10,asyncMDNS);
}

WiFiEventHandler gotIpEventHandler; //, disconnectedEventHandler;

void setup() {
  start=millis();
  Serial.begin(74880);
  WiFi.onEvent(WiFiEvent);
//
  gotIpEventHandler = WiFi.onStationModeGotIP(wifiGotIPHandler);
//
//  SECTION 2
//
  delay(FAILURE_LIMIT);
  Serial.printf("T=%d (T0+%d) Delay below %d always fails, above always succeeds\n",millis(),millis()-start,FAILURE_LIMIT);
  getMDNS();
}

void loop() {
  }

Debug Messages

 ets Jan  8 2013,rst cause:2, boot mode:(3,7)

load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v09f0c112
~ld
⸮T=359 (T0+126) [WiFi-event] WiFi Connected SSID
T=1057 (T0+824) Delay below 823 always fails, above always succeeds
T=1057 (T0+824) call MDNS
Mode: STA
PHY mode: N
Channel: 6
AP id: 0
Status: 1
Auto connect: 1
SSID (7): LaPique
Passphrase (0): 
BSSID set: 0
MDNS listening
queryService cheztoiiot tcp
Waiting for answers..
It Fails
T=3056 (T0+2823) [WiFi-event] WiFi got IP
T=3056 (T0+2823) wifiConnectHandler 192.168.1.57
MDNS listening

AT first glance it looks like a really BAD way to do it - waiting for some arbitrary period and simply hoping that there is a connection - and I agree. I would never write such bad code except to demonstrate the problem:

The debug output shows that below 824ms, the "obvious" happens and we don't yet have an IP before we call MDNS and it is no surpise that it fails! It's not till T0+2283 that we get one - which seems like too long for me, but maybe MDNS "Waiting for answers" is somehow blocking completion of the autoconnect?

Let's try to increase the delay by 1ms and see what happens:

 ets Jan  8 2013,rst cause:2, boot mode:(3,7)

load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v09f0c112
~ld
⸮T=360 (T0+126) [WiFi-event] WiFi Connected SSID
T=1057 (T0+823) [WiFi-event] WiFi got IP
T=1057 (T0+823) wifiConnectHandler 192.168.1.57
T=1059 (T0+825) Delay below 824 always fails, above always succeeds
T=1061 (T0+827) call MDNS
Mode: STA
PHY mode: N
Channel: 6
AP id: 0
Status: 5
Auto connect: 1
SSID (7): LaPique
Passphrase (0): 
BSSID set: 0
MDNS listening
queryService cheztoiiot tcp
Waiting for answers..
Reading answers RX: REQ, ID:0, Q:0, A:5, NS:0, ADD:0
 11 _cheztoiiot
found matching service: cheztoiiot
 4 _tcp
 5 local
type: 000c rdlength: 40
PTR 40 %Chez ToiioT controller on raspberrypi⸮
type: 0010 rdlength: 1
 0:

type: 0021 rdlength: 20
SRV 11 72 61 73 70 62 65 72 72 79 70 69 
raspberrypi
type: 001c rdlength: 16
Ignoring unsupported type 75
type: 0001 rdlength: 4
All answers parsed, adding to _answers list..
It Works
raspberrypi
192.168.1.4
1880

Success! At the "magic" T+823, we get an IP, and MDNS goes on to do exactly what was expected.
So we now have a way to make our code work: wait until we have an IP of course. I don't like "blocking" code, so let's do it the "proper" way and use the event handling code.

I have commented out "Section 2" and uncommented "Section 1" so we only call MDNS once we are absolutely certain we have an IP address. Then the machine state will be identical to the working ("T+824") situation and MDNS must work!

WRONG!!!

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)

load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v09f0c112
~ld
⸮T=368 (T0+126) [WiFi-event] WiFi Connected SSID
T=1065 (T0+823) [WiFi-event] WiFi got IP
T=1065 (T0+823) wifiConnectHandler 192.168.1.57
T=1076 (T0+834) If THIS is not connected, what does WL_CONNECTED mean?T=1076 (T0+834) call MDNS
Mode: STA
PHY mode: N
Channel: 6
AP id: 0
Status: 5
Auto connect: 1
SSID (7): LaPique
Passphrase (0): 
BSSID set: 0
MDNS listening
queryService cheztoiiot tcp
Waiting for answers..
It Fails
Reading answers RX: REQ, ID:0, Q:0, A:5, NS:0, ADD:0
Not expecting any answers right now, returning

We get the IP at T+823, well before we call MDNS at T0+834, but for some reason, it fails.

If we compare the two Diag prints between "synchronous+delay" success at T0+824 with the failing asychronous call at T0+834, it is clear they are identical so why does the asynchronous one fail?

What is different "behind the scenes"? What is different in the "setup thread" and the "ticker thread"?

I modified the code to put the delay call
in "getMDNS" itself. The synchronous call works as before (why wouldn't it?) but the async version fries my brain!

...
void getMDNS(){
  Serial.printf("T=%d (T0+%d) call MDNS\n",millis(),millis()-start);
  delay(FAILURE_LIMIT);
  Serial.printf("T=%d (T0+%d) call MDNS\n",millis(),millis()-start);
  WiFi.printDiag(Serial);
...

Gives:

⸮T=362 (T0+126) [WiFi-event] WiFi Connected SSID
T=3059 (T0+2823) [WiFi-event] WiFi got IP
T=3059 (T0+2823) wifiConnectHandler 192.168.1.57
T=3069 (T0+2833) If THIS is not connected, what does WL_CONNECTED mean?
T=3069 (T0+2833) call MDNS
T=3069 (T0+2833) call MDNS
Mode: STA
PHY mode: N
Channel: 6
AP id: 0
Status: 5
Auto connect: 1
SSID (7): LaPique
Passphrase (0): 
BSSID set: 0

This is truly bizarre! the total code I call in "setup" now is:

  start=millis();
  Serial.begin(74880);
  WiFi.onEvent(WiFiEvent);
  gotIpEventHandler = WiFi.onStationModeGotIP(wifiGotIPHandler);

So how can it be me that is delaying the "got IP" event? I don't even call asyncMDNS (which outputs "what does WL_CONNECTED mean?" at T+2833) until AFTER the wifiConnectHandler call at T+2823!

It's either a negative time coefficient in the flux capacitor or Serial.print is being "stacked up" or unserialised. Is this the same procfess that makes the the delay(FAILURE_LIMIT) last less than 1ms? I really need help on this.

I can't use a "workaround" - the whole of my project relies on the auto connect / asynchronous "got ip" event combination. I have tried about 10000 difrrent permutations but the MDSN call just will not work if it's not in the "setup" thread, no matter how much I try to delay or synchronise.

Help, please!

On a separate note, I also the project MUST have service TXT records for auto-discovery. I found a fork by mblythe86 which includes them, and it works perfectly - when MDNS itself works! When will this be in the production version?

Thanks,
Phil

@igrr
Copy link
Member

igrr commented May 19, 2017

The problem you are having is mostly because you are trying to call synchronous, blocking functions from asynchronous callbacks. MDNS.queryService is a blocking function, as is delay. Neither will work when called from an asynchronous callback (such as Ticker, WiFi event, etc).

I understand your need to start MDNS from a callback, but the current code is not compatible with that — if you look at the source of MDNSResponder::queryService, you will see a delay(1000); in there. It can be made non-blocking, by introducing functions like beginQueryService/endQueryService, but this is left as an exercise for the reader.

@philbowles
Copy link
Author

Thanks for the quick reply. It's disappointing but - I hear you. I had come close to that conclusion, but was looking for a) solution or b) confirmation. I now have b) (and a headache). It might be helpful if the docs were updated to help relative newcomers - like me - from taking up your time! I seriously appreciate all your work and that of the rest of the team. I'm back to the drawing board with nasty globals and polling loops on the main thread...sigh. Any plans for the service txt records?

@igrr
Copy link
Member

igrr commented May 20, 2017

I've asked a few folks who are familiar with mDNS to review the changes in #3107. So far there haven't been any reviews, and I'm not following mDNS library closely enough to understand the impact of these changes. If the PR works for you, that's good, hopefully you can use it in your work :)

Regarding 'nasty globals and polling loops' — you may consider making the modification to MDNS library, splitting queryService into beginQueryService / endQueryService. I.e. everything before the delay there is begin, everything after it is end. That's definitely less complicated than rewriting the logic in your sketch.

Regarding documentation — thanks, noted. There are a few notes here and there about the functions which can not be called from asynchronous callbacks, but this specific case of 'mdns from WiFi events' isn't covered. Will add a note next time when working on the docs.

@igrr igrr closed this as completed May 20, 2017
@mblythe86
Copy link
Contributor

mblythe86 commented Jun 13, 2017

Regarding 'nasty globals and polling loops' — you may consider making the modification to MDNS library, splitting queryService into beginQueryService / endQueryService.

I have another bug to fix in my PR #3107, so I might throw in a version of queryService() that doesn't have the delay(). I will still, however, block until the UDP mutlicast packets are sent, but I don't think that should be a problem. This leads to the question "Well, how do I get responses?". I've already implemented "setAnswerCallback()" that will invoke the callback function for every matching response. Please note that this callback also happens asynchronously (from the TCP/IP stack, I think) so the same limitations about delay() apply.

FYI, when using my own code (with setAnswerCallback() ) I'm seeing fairly frequent Exception 28 failures...I'm still trying to debug those, so there may be further changes to my pull request as I work out any bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants