Skip to content

idf v5.1 listen/accept race condition ends up losing clients #8443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bill88t opened this issue Sep 28, 2023 · 10 comments · Fixed by #8940
Closed

idf v5.1 listen/accept race condition ends up losing clients #8443

bill88t opened this issue Sep 28, 2023 · 10 comments · Fixed by #8940
Assignees
Milestone

Comments

@bill88t
Copy link

bill88t commented Sep 28, 2023

CircuitPython version

Adafruit CircuitPython 9.0.0-alpha.1-50-gaa0d7aad83 on 2023-09-28; M5Stack Timer Camera X with ESP32

Code/REPL

https://github.com/bill88t/CircuitPython_FTP_Server, current master 53224fdd20302d4005054011b51a45591f3ac0fd loaded onto the board

code.py:

import wifi
from socketpool import SocketPool
from ftp_server import ftp
from supervisor import reload

if not wifi.radio.connected:
    print("No wifi")
    reload()

pool = SocketPool(wifi.radio)
ftps = ftp(pool, str(wifi.radio.ipv4_address), verbose=True)
try:
    ftps.serve_till_quit()
except KeyboardInterrupt:
    pass
finally:
    ftps.deinit()
reload()

Behavior

When making a full filesystem dump (237 files), PASV fails with the client never connecting to the data socket.

Well what does that have to do with the core?
Well, first of all, it does work on 8.x without a single error or retry, same settings, same everything.
Second, the failure has to do with the sockets and listen.

Debugging the failure led me to the conclusion that if a client connects very quickly after .listen(..) is run, the client may be accepted by the networking stack, but stored nowhere and so .accept times out.

When this bug happens, .accept() will fail by timeout, raising it's regular OSError, as if no client is waiting.
However the client is connected and waiting.
It is not rejected like when the socket is closed or no more connections are permitted.

Description

So this is some sort of race condition where a client connection comes in at just the wrong time to be accepted by the code but not stored anywhere.

FTP relies on making a new socket for every transfer and directory listing. For that reason we end up making ~300 sockets/connections for a full dump of 237 files.

It would probably be possible to make a simplier example, since we only need to spam (yes, around 7 connections / second) and testing if the connection object is not None.

This issue has already received a workaround on the current (ftp) master, so it's not really a world ending bug.
However it's still an issue that could conceivably appear in prod for a single connection, assuming bad enough luck (timing).

FTP just happens to roll a broken dice enough times to make it break almost 100%.

Additional information

If anybody intends on actually going through the ftp code:

line 645: self._data_socket, self._client_pasv = self._pasv_sock.accept() ends up returning None forever while this bug happens.
The timeout will close the socket and tell the client the file transfer failed (message 25).
The client will then re-request a new socket which will work.

@bill88t bill88t added the bug label Sep 28, 2023
@tannewt tannewt added this to the 9.0.0 milestone Sep 29, 2023
@anecdata
Copy link
Member

Tested 8.2.6 vs. Adafruit CircuitPython 9.0.0-alpha.1-64-gf14ca95df4 on 2023-10-06; Adafruit QT Py ESP32-S3 no psram with ESP32S3 with plain TCP sockets. There's definitely a regression related to accept().

@bill88t
Copy link
Author

bill88t commented Oct 24, 2023

I have been away for a bit of time and lost track of recent changes, testing with today's artifacts, the issue has gotten a whole lot worse, now failing repeatedly, many times in a row..

I don't even think I can make a decent workaround for how badly it breaks now.

I will have to investigate in more detail.

@dhalbert
Copy link
Collaborator

dhalbert commented Dec 1, 2023

@jepler may have seen something like this when making a request before the http server was up.

@jepler
Copy link

jepler commented Dec 1, 2023

In the camera app, I create a webserver, then create the pycamera object, and finally enter the webserver's main loop.

I have seen that when I make a request before it has fully initialized the camera, that request can get lost.

@bill88t
Copy link
Author

bill88t commented Dec 1, 2023

Hmm, well, if you were to put some prints when a client connects, you would probably see no client connection for the missing request.

Do keep in mind if you plan to debug it, it's truly random as to when it will fail. For ftp it may do 200 sockets and then fail, or fail on the first one.

@tannewt
Copy link
Member

tannewt commented Feb 14, 2024

@bill88t Please retest with the absolute newest builds. The IDF has been updated since this was filed originally.

@bill88t
Copy link
Author

bill88t commented Feb 15, 2024

@tannewt Oh I'm still experiencing this issue on the daily since I rely on ftp for non-usb boards.

It has become rarer for some reason.
Not sure when, since I've not been tracking master all that well (usually my main board's version is a week old).

My workarounds still are inadequate, and only work 30% of the time.
Still it should be noted, not a single of these errors had occurred on 8.x.

@tannewt
Copy link
Member

tannewt commented Feb 15, 2024

Can you quickly reproduce this issue? It's hard to fix something that cannot be easily and quickly reproduced.

@bill88t
Copy link
Author

bill88t commented Feb 15, 2024

I prepared an example that relies on no libraries and I had even weirder behaviour:

CP code.py:

import wifi; from socketpool import SocketPool; from supervisor import reload
if not wifi.radio.connected:
    print("No wifi, reloading!"); reload()
print("IP: " + str(wifi.radio.ipv4_address) + "\nSetting up wifi for socket receives")
pool = SocketPool(wifi.radio)
rx_buf = bytearray(64)
def reset_rx_buf():
    for i in range(64):
        rx_buf[i] = 0
while True:
    print("Setting up socket...")
    socket = pool.socket(pool.AF_INET, pool.SOCK_STREAM)
    socket.bind(("0.0.0.0", 23))
    socket.listen(2)
    socket.setblocking(True)
    print("Waiting for socket connection... ", end="")
    try:
        connection = socket.accept()
        connection[0].setblocking(True)
        recv_len = connection[0].recv_into(rx_buf, 64)
        res = bytes(rx_buf)[:recv_len].decode("UTF-8")
        if res == "OK1":
            print("Got first packet.")
            socket2 = pool.socket(pool.AF_INET, pool.SOCK_STREAM)
            socket2.bind(("0.0.0.0", 24))
            socket2.listen(2)
            socket2.setblocking(True)
            print("Sending goahead.. ", end="")
            connection[0].send("OK2")
            print("Sent.\nWaiting on 2nd socket.. ", end="")
            connection2 = socket.accept()
            connection2[0].setblocking(True)
            recv_len = connection2[0].recv_into(rx_buf, 64)
            res = bytes(rx_buf)[:recv_len].decode("UTF-8")
            if res == "OK3":
                print("Got 2nd packet.")
            else:
                print("Invalid data on 2: '" + res + "'")
            connection2[0].close()
            reset_rx_buf()
        else:
            print("Invalid data on 1: '" + res + "'")
        reset_rx_buf()
        connection[0].close()
    except OSError:
        print("Timeout!")
    socket.close()

Desktop Python tx.py:

import socket; import sys; import signal; from time import sleep
def signal_handler(sig, frame):
    print("\nExiting...")
    sys.exit(0)
def main():
    signal.signal(signal.SIGINT, signal_handler)
    if len(sys.argv) != 2:
        print("Usage: python script.py <IP>")
        sys.exit(1)
    while True:
        try:
            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
                s.connect((sys.argv[1], 23))
                s.sendall(b"OK1")
                sleep(0.3)
                data = s.recv(1024).decode("UTF-8")
                if data == "OK2":
                    s.close()
                    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s2:
                        s2.connect((sys.argv[1], 24))
                        s2.sendall(b"OK3")
                        s2.close()
                else:
                    print("Invalid data:", list(data))
                    break
        except socket.error:
            print("Failed to connect.")
        sleep(0.05)
if __name__ == "__main__":
    main()

(compacted down to fit in a reasonable amount of pages)

Running the code.py on a waveshare_esp32s2_pico and tx.py like py tx.py 10.42.0.216, I get some weird results:

CP:

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
IP: 10.42.0.216
Setting up wifi for socket receives
Setting up socket...
Waiting for socket connection... Got first packet.
Sending goahead.. Sent.
Waiting on 2nd socket.. Invalid data on 2: 'OK1'
Setting up socket...
Waiting for socket connection...

Desktop:

[bill88t@Thinkpood | ~]> py tx.py 10.42.0.216
Invalid data: []
[bill88t@Thinkpood | ~]>

Which means somehow, send() failed, and on the next loop, the data that should have gone to socket on port 24 went to port 23's socket.
Maybe deinitialized sockets end up "merging" in a sense with existing ones?
That would explain some of the issues.

@tannewt tannewt self-assigned this Feb 16, 2024
@tannewt
Copy link
Member

tannewt commented Feb 16, 2024

So I don't think there is necessarily a bug in CP here. The example code has a couple bugs from its intention I think.

  1. connection2 = socket.accept() is meant to be connection2 = socket2.accept(). That way it gets the correct data.
  2. socket2 is never closed in code.py. socket.bind() currently allows port reuse which makes this leak not obvious. As a consequence, the total number of sockets gets filled up with each loop.

I don't think bind() should allow reuse implicitly anymore. That will make this management clearer.

dhalbert pushed a commit that referenced this issue Feb 17, 2024
Doing it implicitly can lead to mistaken socket leaks and reuse.
It now matches CPython.

Fixes #8443
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants