Skip to content

BATCH_DELIVERY_TIMEOUT behavior is broken and leads to emails being sent multiple times #479

@bbbergh

Description

@bbbergh

The current implementation of BATCH_DELIVERY_TIMEOUT is bugged in multiple ways, which means that if a timeout ever occurs it very likely leads to emails being sent to the mail server without being marked as sent, and thus sent multiple times, and potentially infinitely often.

The responsible code sits in _def send_bulk of mail.py, with the key snippet being:

pool = ThreadPool(number_of_threads)
results = []
for email in emails:
results.append(pool.apply_async(send, args=(email,)))
timeout = get_batch_delivery_timeout()
# Wait for all tasks to complete with a timeout
# The get method is used with a timeout to wait for each result
for result in results:
result.get(timeout=timeout)
# for result in results:
# try:
# # Wait for all tasks to complete with a timeout
# # The get method is used with a timeout to wait for each result
# result.get(timeout=timeout)
# except TimeoutError:
# logger.exception("Process timed out after %d seconds" % timeout)
pool.close()
pool.join()

Whenever a timeout is reached, result.get(timeout=timeout) raises an exception which is not caught, and the execution of the function terminates. In particular this means that:

  1. The database is not updated for any emails which have been previously sucessfully sent in the same batch.
  2. The thread pool is left in a running state and never closed. In particular the execution of any of the sending tasks (the one exceeding the timeout, and also other scheduled ones) is not cancelled and will keep running until the thread pool gets garbage collected (which seems to be happening at a random time).

This is obviously quite bad, since if a timeout ever gets reached it means that a random share of emails may get sent sucessfully, but will not be marked as such, and will be resent whenever all queued emails are due to be resent again. If whatever leads to the timeout being exceeded is still present then, this will keep going on forever, and people get spammed with the same email many many times.

The first of the two issues listed above is not hard to fix by slightly changing the code, the second one however is not, and it seems that there are quite a few issues with how timeouts are handled in general:

  • It is not obvious why timeout errors (which are currently raised into the calling code) should be handled differently from any other errors which may occur during sending (such as SMTP connection errors), which are currently cought and lead to the email being marked as failed.
  • Since we can't really cancel the tasks of sending the email, implementing timeouts at this level seems generally a bad idea. Even if we end up killing the thread that sends the request, it is impossible to tell if the transfer has actually been completed (this is unlikely, but may still happen sometimes), and this would also leave the connection in a broken state. It seems that timeouts should be left to the underlying network code, which can assess the state of the connection and the request and cancel in a way that always implies failure and also keeps the connection in a healthy state.
  • The timeouts as they are implemented currently don't actually work as advertised in the description of BATCH_DELIVERY_TIMEOUT, as they limit not the duration of the entire batch send, but just a random share of it: Since the tasks share connections, which (for SMTP) use a Lock internally, the tasks are actually executed sequentially in essentially a random order. The tasks are also awaited sequentially, and a timeout occurs whenever a task is awaited for too long, where the await duration is the sum of the time waiting for the task to start executing plus its actual execution time.

Overall it seems like the only way to implement timeouts correctly and consistently would be to check after the completion of every individual mail if a total time for the entire batch has been exceeded, and then maybe stop. This would not catch the case where a single mail takes a long time to be sent, however as stated above it is not obvious how one would correctly abort in the case where that is detected.

In any case, with the current implementation the default value of BATCH_DELIVERY_TIMEOUT should definitely be None as otherwise this leads to some very very bad behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions