Add question-answer example for v2 trainer#2580
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
8e336ff to
75062fd
Compare
|
@solanyn Thanks for this amazing work! Appreciate your precious usecases for Trainer. I left my initial reviews for you. /cc @kubeflow/wg-training-leads @astefanutti |
Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
|
Thanks @Electronic-Waste, I've updated the branch to address your comments! |
|
@solanyn Thanks for this! And welcome to the Kubeflow community! /lgtm |
Pull Request Test Coverage Report for Build 14721592890Details
💛 - Coveralls |
There was a problem hiding this comment.
Thank you for this great contribution @solanyn !
I left a few comments.
It would be awesome to have this example as part of our first Kubeflow Trainer 2.0 release.
/assign @tenzen-y @saileshd1402 @kubeflow/wg-training-leads @astefanutti @shravan-achar @akshaychitneni
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Andrew Chen <14799876+solanyn@users.noreply.github.com>
|
/ok-to-test |
* run train job on CPU * reduce batch size, dataset size and train epochs * make upload to bucket optional * add notebook to e2e-test * set model name as trainjob argument Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
|
/rerun-all |
* e2e tests fail if trainjobs launched by notebook do not finish in 3s * extends the timeout to 5min to block and wait for longer trainjobs until timeout or trainjob completes Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
* revert change to e2e-run-notebook.sh Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
|
@andreyvelich @Electronic-Waste this should be good to go, let me know if there is anything else you'd like to see in this change! |
andreyvelich
left a comment
There was a problem hiding this comment.
Thank you for this great contribution @solanyn 🎉
/lgtm
/assign @Electronic-Waste @tenzen-y @astefanutti @saileshd1402
|
@andreyvelich: GitHub didn't allow me to assign the following users: saileshd1402. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Add question-answer example Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com> * chore: remove unused lines, add TODO comment Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com> * chore: update example description Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Andrew Chen <14799876+solanyn@users.noreply.github.com> * chore: update question-answering example * run train job on CPU * reduce batch size, dataset size and train epochs * make upload to bucket optional * add notebook to e2e-test * set model name as trainjob argument Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com> * chore: extend e2e-run-notebook timeout * e2e tests fail if trainjobs launched by notebook do not finish in 3s * extends the timeout to 5min to block and wait for longer trainjobs until timeout or trainjob completes Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com> * chore: update example to wait for trainjob running status * revert change to e2e-run-notebook.sh Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com> --------- Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com> Signed-off-by: Andrew Chen <14799876+solanyn@users.noreply.github.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
What this PR does / why we need it:
This PR adds an example for the V2 Training Operator to train a question-answer model based on the HuggingFace recipe adapted to the training operator API.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Adds a question-answer example mentioned in #2040 using the pytorch runtime.
Checklist: