-
Notifications
You must be signed in to change notification settings - Fork 31.7k
[tests] Smaller model in slow cache tests #37922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
| def setUp(self): | ||
| # Clears memory before each test. Some tests use large models, which might result in suboptimal torch | ||
| # re-allocation if we run multiple tests in a row without clearing memory. | ||
| cleanup(torch_device, gc_collect=True) | ||
|
|
||
| @classmethod | ||
| def tearDownClass(cls): | ||
| # Clears memory after the last test. See `setUp` for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to this fix.
On main, we're not reusing GPU memory properly across from_pretrained calls. Until it is sorted, to prevent flaky tests, we have to start each test with a memory reset. If we don't start with a memory reset, the first test in this class might have memory-related issues (see diff in test_cache_copy, the first test run in this class)
| "enriching experience that broadens our horizons and allows us to explore the world beyond our comfort " | ||
| "zones. Whether it's a short weekend getaway", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the device, if we run this test in isolation (RUN_SLOW=1 py.test tests/utils/test_cache_utils.py -k test_cache_copy) we might get a different output compared to a full test suite run. With the updated memory reset, this is the correct output in all combinations -- see comment above
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ydshieh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
|
it's nice to try to use |
|
run-slow: utils |
|
This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs: models: ['utils'] |
394e83f to
899cb26
Compare
|
(there are some multi-gpu issues, fixing them) |
|
@ydshieh the multi-gpu issues require extensive changes on the offloaded caches, I'm merging this as-is and I'll open a new PR to make multi-gpu work! |
What does this PR do?
Our CI is failing due to OOM in some slow tests (see
CacheHardIntegrationTestfailures here).This PR replaces the 7B model (requires ~15GB VRAM) with a 4B model (requires ~9GB VRAM) in tests that were using a 7B model. It also makes a few more minor modifications to ensure a green CI (commented in the diff)