Skip to content

Conversation

juj
Copy link
Collaborator

@juj juj commented Sep 14, 2025

This is a cleaned up version 2 of #25237.


Two problems that I run into with the parallel browser harness on Windows with Firefox:

  1. On Windows, all the browser windows open up in the exact same X,Y coordinate. Firefox has "is browser in the background" detection, which causes all but one test foreground harness to hang when running any requestAnimationFrame() based test.

  2. Killing a Firefox browser process with

proc = subprocess.Popen(['firefox'])
proc.kill()

does not work. This is because Firefox is a multiprocess browser, and does not have a process hierarchy where the first launched process would be some kind of a master process. (this is an old known issue, that can be seen also in existing emrun.py code) There seems to be about 12 processes that come alive when running subprocess.Popen(['firefox']).


To fix these issues:

  1. ask the Windows Win32 API to move each spawned process to its own X,Y location. Use a file-based multiprocessing lock + counter file to get unique counter to each process.

  2. Snapshot alive processes before spawning Firefox, and after. Then acquire the delta of these two to find which process IDs correspond to the launched Firefox.

@juj
Copy link
Collaborator Author

juj commented Sep 16, 2025

This is the last PR remaining that is blocking me from testing my CI directly on emscripten/main rather than my own fork. Would be good to get this in so that I can migrate to target my CI on emscripten/main.

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to find a way keep this code isolated to only running on windows? We don't need any of that FileLock stuff except on windows right?

Perhaps it could even move into its own file? test/browser_launcher.py?

test/common.py Outdated
"""Deletes a directory. Returns whether deletion succeeded."""
if not os.path.exists(dirname):
return True
if os.path.isfile(dirname):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about assert not os.path.isfile(dirname) since it seems like a logical error to call this on a non-directory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

test/common.py Outdated
# Firefox is a multiprocess browser. Killing the spawned process will not
# bring down the whole browser, but only one browser tab. So take a delta
# snapshot before->after spawning the browser to find which subprocesses
# we launched.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment only apply to windows?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems to. Updated the comment.

test/common.py Outdated
# Delete old browser data directory. If we cannot (the data dir is in use on Windows),
# switch to another dir.
while not force_delete_dir(browser_data_dir):
browser_data_dir += '-another'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could generated -another-another-another-another? Should we use a counter and do _1, _2 etc?

Also, like the the rest of this PR I'd would love to not run this loop except on windows.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it can, and does. I wanted to keep the logic simple. Updated to use _number

@sbc100
Copy link
Collaborator

sbc100 commented Sep 16, 2025

I'd also love to add some windows + firefox testing on the emscripten CI as part of this so that we can feel free to refactor without breaking stuff.

@juj
Copy link
Collaborator Author

juj commented Sep 16, 2025

I'd like to find a way keep this code isolated to only running on windows? We don't need any of that FileLock stuff except on windows right?

I was thinking that the FileLock could serve as a mechanism to solve #25069 (comment) later as well.

The overhead of running the FileLock mechanism on all platforms should be trivial and negligible. I can certainly refactor it out, though I figured your review would nit on two separate/duplicate implementations to launch the browser if I did. To me the logic reads cleaner with one common launcher impl.

@juj
Copy link
Collaborator Author

juj commented Sep 16, 2025

I'd also love to add some windows + firefox testing on the emscripten CI as part of this so that we can feel free to refactor without breaking stuff.

That is a good idea. I think that is good to develop in a separate PR.

@juj
Copy link
Collaborator Author

juj commented Sep 17, 2025

I'd also love to add some windows + firefox testing on the emscripten CI as part of this so that we can feel free to refactor without breaking stuff.

That is a good idea. I think that is good to develop in a separate PR.

Ok, updated this PR to include Firefox on Windows testing.

@juj
Copy link
Collaborator Author

juj commented Sep 17, 2025

Looks like CircleCI now also passes here on the new Firefox on Windows tests.

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that so great having windows browser tests running! Thanks for taking care of that.

I think maybe we should split them into their own running, but I'm happy to do that as a followup.

Regarding the process tracking issue ("Firefox is a multiprocess browser, and does not have a process hierarchy where the first launched process would be some kind of a master process."), do you know why it only effect firefox and not chrome (which is also multi-process? I'm just curious if we could maybe get this fixed and remove all this extra code one day?

cls.browser_procs = list(set(procs_after).difference(set(procs_before)))
# Wrap window positions on a Full HD desktop area modulo primes.
for proc in cls.browser_procs:
move_browser_window(proc.pid, (300 + count * 47) % 1901, (10 + count * 37) % 997)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of putting if WINDOWS in through this block can we make the entire thing windows-only:

if not WINDOWS:
  cls.browser_procs = [subprocess.Popen(browser_args + [url])]
else:
  ....

Perhaps append url to browser_args first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I moved to a separate function.

@@ -564,6 +564,8 @@ def set_env(name, option_value):

# Remove any old test files before starting the run
utils.delete_file(common.flaky_tests_log_filename)
utils.delete_file(common.browser_spawn_lock_filename)
utils.delete_file(f'{common.browser_spawn_lock_filename}_counter')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to store the count in the lock file itself? (i.e. can we avoid separate counter file?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that first, but got odd behavior with O_EXCL when O_CREAT was not always unconditionally passed. So reverted to this simpler two-file form

# the browser.
cls.browser_procs = list(set(procs_after).difference(set(procs_before)))
# Wrap window positions on a Full HD desktop area modulo primes.
for proc in cls.browser_procs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed in headless mode. Headless should bypass the window active tracking stuff.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated.

command: |
# To download Firefox, we must first figure out what the latest Firefox version name is.
# This is because there does not exist a stable/static URL to download latest Firefox from.
$html = Invoke-WebRequest "https://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards beta, so we're not living so near the bleeding edge.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pre-existing. We already use firefox-nightly elsewhere in this file. We can consider changing separately.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linux Firefox runs also download Nightly, so I did not want to diverge there. Maybe if we want to test beta, we'd switch both Linux and Windows Firefoxes at the same time, as a separate PR?

@juj juj merged commit b52308c into emscripten-core:main Sep 17, 2025
29 of 31 checks passed
@juj
Copy link
Collaborator Author

juj commented Sep 17, 2025

do you know why it only effect firefox and not chrome (which is also multi-process?

I haven't stress tested browser tests on Chrome on Windows yet, so not quite sure.

I'm just curious if we could maybe get this fixed and remove all this extra code one day?

Maybe. But one has to be careful.

The need to move windows is only present on Windows. On Linux Mint 22 and Raspberry Pi 5 OS at least, the Linux windowing system cascades the windows automatically for Firefox, so they won't be interpreted to be in background.

On macOS, all Firefox windows do stack up in identical locations, but curiously this is not a problem there: Firefox still thinks all those windows are visible. Maybe this is because of the active program vs active window distinction that is present in the macOS windowing system, or maybe Firefox just doesn't implement the is-visible logic there. Not sure.

So the active process tracking is needed for the purposes of moving the Firefox browser windows only on Windows OS.

On Windows, in addition to being able to move browser windows, Firefox processes need to be tracked for clean termination. Refactoring this code requires very special care, because when all tests are green in the harness, i.e. "the happy path", then the browser harness does not have a need to ever force-terminate and restart any browser instance.

So even if this process tracking was plain removed, it might not show up immediately as a problem, if every test in the browser harness is green. I.e. in green state, the browser_restart() function will never need to be run.

When refactoring or stress-testing, it is important to manually place some browser tests in a hanging state to ensure that browser_restart() is actually getting called during the test run.

@sbc100
Copy link
Collaborator

sbc100 commented Sep 17, 2025

do you know why it only effect firefox and not chrome (which is also multi-process?

I haven't stress tested browser tests on Chrome on Windows yet, so not quite sure.

I'm just curious if we could maybe get this fixed and remove all this extra code one day?

Maybe. But one has to be careful.

The need to move windows is only present on Windows. On Linux Mint 22 and Raspberry Pi 5 OS at least, the Linux windowing system cascades the windows automatically for Firefox, so they won't be interpreted to be in background.

On macOS, all Firefox windows do stack up in identical locations, but curiously this is not a problem there: Firefox still thinks all those windows are visible. Maybe this is because of the active program vs active window distinction that is present in the macOS windowing system, or maybe Firefox just doesn't implement the is-visible logic there. Not sure.

So the active process tracking is needed for the purposes of moving the Firefox browser windows only on Windows OS.

On Windows, in addition to being able to move browser windows, Firefox processes need to be tracked for clean termination. Refactoring this code requires very special care, because when all tests are green in the harness, i.e. "the happy path", then the browser harness does not have a need to ever force-terminate and restart any browser instance.

So even if this process tracking was plain removed, it might not show up immediately as a problem, if every test in the browser harness is green. I.e. in green state, the browser_restart() function will never need to be run.

When refactoring or stress-testing, it is important to manually place some browser tests in a hanging state to ensure that browser_restart() is actually getting called during the test run.

I see, so perhaps we should have some kind of stress test most the restarts the browser for each test.

Even if we did that, how would we know for sure that we didn't still have the old browsers lying around? Especially in headless mode. For that matter, how did you notice this issue in headless mode? Or are you running locally in headfull mode?

@sbc100
Copy link
Collaborator

sbc100 commented Sep 17, 2025

Just to be clear, if we are able to shut down the FF instance fully, would we still also need the window moving logic?

@juj
Copy link
Collaborator Author

juj commented Sep 17, 2025

Even if we did that, how would we know for sure that we didn't still have the old browsers lying around? Especially in headless mode.

Checking with len(list_processes_by_name('firefox.exe')) > 0 will accurately tell if there are old browsers lying around, as long as we make an assumption that user does not have unrelated Firefox browser instances open in the background. (which CI won't have)

This will work also in headless mode.

For that matter, how did you notice this issue in headless mode? Or are you running locally in headfull mode?

I currently only run browser tests in headful mode.

Just to be clear, if we are able to shut down the FF instance fully, would we still also need the window moving logic?

Yes, these two needs (tracking processes to be able to move windows) and (tracking processes to be able to shut down FF instances fully) are two orthogonal needs to do process tracking.

@sbc100
Copy link
Collaborator

sbc100 commented Sep 17, 2025

Yes, these two needs (tracking processes to be able to move windows) and (tracking processes to be able to shut down FF instances fully) are two orthogonal needs to do process tracking.

But do we still need to move the windows if we can shutdown the old FF windows? Or is the moving only needed because the old window lingers?

@juj
Copy link
Collaborator Author

juj commented Sep 17, 2025

But do we still need to move the windows if we can shutdown the old FF windows? Or is the moving only needed because the old window lingers?

The moving is needed because if the harness launches, say, 8 browser instances, then on Windows on Firefox, all those browser instances launch in the exact same (x, y, width, height) window rectangle. Seven of those Firefoxes will decide that the browser tab is not visible, so Firefox will not bother to even load up the Emscripten test page contents. Only the topmost Firefox window will load up an Emscripten test.

So the seven background Firefoxes will time out in a hang after five minutes, while only the topmost window makes progress.

End result is that most of the tests do pass in the suite, save for a random selection that got allocated into a background browser window in the first place.

@sbc100
Copy link
Collaborator

sbc100 commented Sep 17, 2025

But do we still need to move the windows if we can shutdown the old FF windows? Or is the moving only needed because the old window lingers?

The moving is needed because if the harness launches, say, 8 browser instances, then on Windows on Firefox, all those browser instances launch in the exact same (x, y, width, height) window rectangle. Seven of those Firefoxes will decide that the browser tab is not visible, so Firefox will not bother to even load up the Emscripten test page contents. Only the topmost Firefox window will load up an Emscripten test.

So the seven background Firefoxes will time out in a hang after five minutes, while only the topmost window makes progress.

End result is that most of the tests do pass in the suite, save for a random selection that got allocated into a background browser window in the first place.

Ah I see so it relates the parallel runner. Got it. Should not be needed with -j1?

@juj
Copy link
Collaborator Author

juj commented Sep 17, 2025

Yes, indeed, if running in singlethreaded mode, moving the Firefox windows is not needed. This is realized with the if worker_id is not None: checks: if running the browser harness in singlethreaded mode, worker_id = None, and so the window moving logic is not applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants