Add question-answer example for v2 trainer by solanyn · Pull Request #2580 · kubeflow/trainer

solanyn · 2025-04-01T16:19:50Z

What this PR does / why we need it:

This PR adds an example for the V2 Training Operator to train a question-answer model based on the HuggingFace recipe adapted to the training operator API.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Adds a question-answer example mentioned in #2040 using the pytorch runtime.

Checklist:

Docs included if any changes are user facing

review-notebook-app · 2025-04-01T16:19:55Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

Electronic-Waste · 2025-04-02T04:00:29Z

@solanyn Thanks for this amazing work! Appreciate your precious usecases for Trainer. I left my initial reviews for you.

/cc @kubeflow/wg-training-leads @astefanutti

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

solanyn · 2025-04-02T04:24:16Z

Thanks @Electronic-Waste, I've updated the branch to address your comments!

Electronic-Waste · 2025-04-02T06:50:22Z

@solanyn Thanks for this! And welcome to the Kubeflow community!

/lgtm
/assign @kubeflow/wg-training-leads @astefanutti

coveralls · 2025-04-02T23:02:02Z

Pull Request Test Coverage Report for Build 14721592890

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 26.207%

Totals
Change from base Build 14713473393:	0.0%
Covered Lines:	684
Relevant Lines:	2610

💛 - Coveralls

andreyvelich

Thank you for this great contribution @solanyn !
I left a few comments.
It would be awesome to have this example as part of our first Kubeflow Trainer 2.0 release.
/assign @tenzen-y @saileshd1402 @kubeflow/wg-training-leads @astefanutti @shravan-achar @akshaychitneni

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Andrew Chen <14799876+solanyn@users.noreply.github.com>

andreyvelich · 2025-04-25T22:40:26Z

/ok-to-test

* run train job on CPU * reduce batch size, dataset size and train epochs * make upload to bucket optional * add notebook to e2e-test * set model name as trainjob argument Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

andreyvelich · 2025-04-28T11:08:15Z

/rerun-all

* e2e tests fail if trainjobs launched by notebook do not finish in 3s * extends the timeout to 5min to block and wait for longer trainjobs until timeout or trainjob completes Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

* revert change to e2e-run-notebook.sh Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

solanyn · 2025-04-29T07:49:42Z

@andreyvelich @Electronic-Waste this should be good to go, let me know if there is anything else you'd like to see in this change!

andreyvelich

Thank you for this great contribution @solanyn 🎉
/lgtm
/assign @Electronic-Waste @tenzen-y @astefanutti @saileshd1402

google-oss-prow · 2025-04-29T10:12:50Z

@andreyvelich: GitHub didn't allow me to assign the following users: saileshd1402.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

Thank you for this great contribution @solanyn 🎉
/lgtm
/assign @Electronic-Waste @tenzen-y @astefanutti @saileshd1402

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andreyvelich · 2025-05-09T21:08:04Z

/approve

google-oss-prow · 2025-05-09T21:08:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* Add question-answer example Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com> * chore: remove unused lines, add TODO comment Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com> * chore: update example description Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Andrew Chen <14799876+solanyn@users.noreply.github.com> * chore: update question-answering example * run train job on CPU * reduce batch size, dataset size and train epochs * make upload to bucket optional * add notebook to e2e-test * set model name as trainjob argument Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com> * chore: extend e2e-run-notebook timeout * e2e tests fail if trainjobs launched by notebook do not finish in 3s * extends the timeout to 5min to block and wait for longer trainjobs until timeout or trainjob completes Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com> * chore: update example to wait for trainjob running status * revert change to e2e-run-notebook.sh Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com> --------- Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com> Signed-off-by: Andrew Chen <14799876+solanyn@users.noreply.github.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow Bot requested a review from Electronic-Waste April 1, 2025 16:19

google-oss-prow Bot requested a review from jinchihe April 1, 2025 16:19

google-oss-prow Bot added the size/XL label Apr 1, 2025

Add question-answer example

75062fd

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

solanyn force-pushed the solanyn/question-answer-example branch from 8e336ff to 75062fd Compare April 1, 2025 16:20

Electronic-Waste reviewed Apr 2, 2025

View reviewed changes

Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb

Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb

Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb

google-oss-prow Bot requested review from a team and astefanutti April 2, 2025 04:00

chore: remove unused lines, add TODO comment

a7f044b

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

google-oss-prow Bot assigned Electronic-Waste Apr 2, 2025

google-oss-prow Bot added the lgtm label Apr 2, 2025

andreyvelich reviewed Apr 25, 2025

View reviewed changes

chore: update example description

ec8d1ea

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Andrew Chen <14799876+solanyn@users.noreply.github.com>

google-oss-prow Bot removed the lgtm label Apr 25, 2025

google-oss-prow Bot added the ok-to-test label Apr 25, 2025

chore: update question-answering example

5445bc1

* run train job on CPU * reduce batch size, dataset size and train epochs * make upload to bucket optional * add notebook to e2e-test * set model name as trainjob argument Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

chore: extend e2e-run-notebook timeout

d3e4cf4

* e2e tests fail if trainjobs launched by notebook do not finish in 3s * extends the timeout to 5min to block and wait for longer trainjobs until timeout or trainjob completes Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

andreyvelich reviewed Apr 28, 2025

View reviewed changes

Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb

solanyn added 2 commits April 29, 2025 11:28

chore: update example to wait for trainjob running status

9206726

* revert change to e2e-run-notebook.sh Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

Merge branch 'kubeflow:master' into solanyn/question-answer-example

6230415

andreyvelich reviewed Apr 29, 2025

View reviewed changes

google-oss-prow Bot assigned astefanutti Apr 29, 2025

google-oss-prow Bot assigned tenzen-y Apr 29, 2025

google-oss-prow Bot assigned andreyvelich Apr 29, 2025

google-oss-prow Bot added the lgtm label Apr 29, 2025

google-oss-prow Bot added the approved label May 9, 2025

google-oss-prow Bot merged commit 074d8b8 into kubeflow:master May 9, 2025
17 checks passed

google-oss-prow Bot added this to the v2.0 milestone May 9, 2025

Conversation

solanyn commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app Bot commented Apr 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Electronic-Waste commented Apr 2, 2025

Uh oh!

solanyn commented Apr 2, 2025

Uh oh!

Electronic-Waste commented Apr 2, 2025

Uh oh!

coveralls commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 14721592890

Details

💛 - Coveralls

Uh oh!

andreyvelich left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreyvelich commented Apr 25, 2025

Uh oh!

andreyvelich commented Apr 28, 2025

Uh oh!

Uh oh!

solanyn commented Apr 29, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow Bot commented Apr 29, 2025

Uh oh!

andreyvelich commented May 9, 2025

Uh oh!

google-oss-prow Bot commented May 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

solanyn commented Apr 1, 2025 •

edited

Loading

coveralls commented Apr 2, 2025 •

edited

Loading

andreyvelich left a comment •

edited

Loading