-
Notifications
You must be signed in to change notification settings - Fork 1.3k
idf v5.1 listen
/accept
race condition ends up losing clients
#8443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tested |
I have been away for a bit of time and lost track of recent changes, testing with today's artifacts, the issue has gotten a whole lot worse, now failing repeatedly, many times in a row.. I don't even think I can make a decent workaround for how badly it breaks now. I will have to investigate in more detail. |
@jepler may have seen something like this when making a request before the http server was up. |
In the camera app, I create a webserver, then create the pycamera object, and finally enter the webserver's main loop. I have seen that when I make a request before it has fully initialized the camera, that request can get lost. |
Hmm, well, if you were to put some prints when a client connects, you would probably see no client connection for the missing request. Do keep in mind if you plan to debug it, it's truly random as to when it will fail. For ftp it may do 200 sockets and then fail, or fail on the first one. |
@bill88t Please retest with the absolute newest builds. The IDF has been updated since this was filed originally. |
@tannewt Oh I'm still experiencing this issue on the daily since I rely on ftp for non-usb boards. It has become rarer for some reason. My workarounds still are inadequate, and only work 30% of the time. |
Can you quickly reproduce this issue? It's hard to fix something that cannot be easily and quickly reproduced. |
I prepared an example that relies on no libraries and I had even weirder behaviour: CP
Desktop Python
(compacted down to fit in a reasonable amount of pages) Running the CP:
Desktop:
Which means somehow, |
So I don't think there is necessarily a bug in CP here. The example code has a couple bugs from its intention I think.
I don't think |
Doing it implicitly can lead to mistaken socket leaks and reuse. It now matches CPython. Fixes #8443
CircuitPython version
Code/REPL
Behavior
When making a full filesystem dump (237 files), PASV fails with the client never connecting to the data socket.
Well what does that have to do with the core?
Well, first of all, it does work on 8.x without a single error or retry, same settings, same everything.
Second, the failure has to do with the sockets and
listen
.Debugging the failure led me to the conclusion that if a client connects very quickly after
.listen(..)
is run, the client may be accepted by the networking stack, but stored nowhere and so.accept
times out.When this bug happens,
.accept()
will fail by timeout, raising it's regularOSError
, as if no client is waiting.However the client is connected and waiting.
It is not rejected like when the socket is closed or no more connections are permitted.
Description
So this is some sort of race condition where a client connection comes in at just the wrong time to be accepted by the code but not stored anywhere.
FTP relies on making a new socket for every transfer and directory listing. For that reason we end up making ~300 sockets/connections for a full dump of 237 files.
It would probably be possible to make a simplier example, since we only need to spam (yes, around 7 connections / second) and testing if the connection object
is not None
.This issue has already received a workaround on the current (ftp) master, so it's not really a world ending bug.
However it's still an issue that could conceivably appear in prod for a single connection, assuming bad enough luck (timing).
FTP just happens to roll a broken dice enough times to make it break almost 100%.
Additional information
If anybody intends on actually going through the ftp code:
line 645:
self._data_socket, self._client_pasv = self._pasv_sock.accept()
ends up returningNone
forever while this bug happens.The timeout will close the socket and tell the client the file transfer failed (message 25).
The client will then re-request a new socket which will work.
The text was updated successfully, but these errors were encountered: