Skip to content

Add question-answer example for v2 trainer#2580

Merged
google-oss-prow[bot] merged 7 commits into
kubeflow:masterfrom
solanyn:solanyn/question-answer-example
May 9, 2025
Merged

Add question-answer example for v2 trainer#2580
google-oss-prow[bot] merged 7 commits into
kubeflow:masterfrom
solanyn:solanyn/question-answer-example

Conversation

@solanyn

@solanyn solanyn commented Apr 1, 2025

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

This PR adds an example for the V2 Training Operator to train a question-answer model based on the HuggingFace recipe adapted to the training operator API.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Adds a question-answer example mentioned in #2040 using the pytorch runtime.

Checklist:

  • Docs included if any changes are user facing

@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-oss-prow google-oss-prow Bot requested a review from jinchihe April 1, 2025 16:19
Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
@solanyn solanyn force-pushed the solanyn/question-answer-example branch from 8e336ff to 75062fd Compare April 1, 2025 16:20
Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb
Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb
Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb
@Electronic-Waste

Copy link
Copy Markdown
Member

@solanyn Thanks for this amazing work! Appreciate your precious usecases for Trainer. I left my initial reviews for you.

/cc @kubeflow/wg-training-leads @astefanutti

@google-oss-prow google-oss-prow Bot requested review from a team and astefanutti April 2, 2025 04:00
Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
@solanyn

solanyn commented Apr 2, 2025

Copy link
Copy Markdown
Contributor Author

Thanks @Electronic-Waste, I've updated the branch to address your comments!

@Electronic-Waste

Copy link
Copy Markdown
Member

@solanyn Thanks for this! And welcome to the Kubeflow community!

/lgtm
/assign @kubeflow/wg-training-leads @astefanutti

@coveralls

coveralls commented Apr 2, 2025

Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 14721592890

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 26.207%

Totals Coverage Status
Change from base Build 14713473393: 0.0%
Covered Lines: 684
Relevant Lines: 2610

💛 - Coveralls

@andreyvelich andreyvelich left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution @solanyn !
I left a few comments.
It would be awesome to have this example as part of our first Kubeflow Trainer 2.0 release.
/assign @tenzen-y @saileshd1402 @kubeflow/wg-training-leads @astefanutti @shravan-achar @akshaychitneni

Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb Outdated
Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb Outdated
Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb Outdated
Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb Outdated
Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb Outdated
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrew Chen <14799876+solanyn@users.noreply.github.com>
@google-oss-prow google-oss-prow Bot removed the lgtm label Apr 25, 2025
@andreyvelich

Copy link
Copy Markdown
Member

/ok-to-test

* run train job on CPU
* reduce batch size, dataset size and train epochs
* make upload to bucket optional
* add notebook to e2e-test
* set model name as trainjob argument

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
@andreyvelich

Copy link
Copy Markdown
Member

/rerun-all

* e2e tests fail if trainjobs launched by notebook do not finish in 3s
* extends the timeout to 5min to block and wait for longer trainjobs until timeout or trainjob completes

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
Comment thread examples/pytorch/question-answering/fine-tune-distilbert.ipynb
solanyn added 2 commits April 29, 2025 11:28
* revert change to e2e-run-notebook.sh

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
@solanyn

solanyn commented Apr 29, 2025

Copy link
Copy Markdown
Contributor Author

@andreyvelich @Electronic-Waste this should be good to go, let me know if there is anything else you'd like to see in this change!

@andreyvelich andreyvelich left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution @solanyn 🎉
/lgtm
/assign @Electronic-Waste @tenzen-y @astefanutti @saileshd1402

@google-oss-prow

Copy link
Copy Markdown

@andreyvelich: GitHub didn't allow me to assign the following users: saileshd1402.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

Thank you for this great contribution @solanyn 🎉
/lgtm
/assign @Electronic-Waste @tenzen-y @astefanutti @saileshd1402

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andreyvelich

Copy link
Copy Markdown
Member

/approve

@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit 074d8b8 into kubeflow:master May 9, 2025
17 checks passed
@google-oss-prow google-oss-prow Bot added this to the v2.0 milestone May 9, 2025
akagami-harsh pushed a commit to akagami-harsh/training-operator that referenced this pull request Jul 17, 2025
* Add question-answer example

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

* chore: remove unused lines, add TODO comment

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

* chore: update example description

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrew Chen <14799876+solanyn@users.noreply.github.com>

* chore: update question-answering example

* run train job on CPU
* reduce batch size, dataset size and train epochs
* make upload to bucket optional
* add notebook to e2e-test
* set model name as trainjob argument

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

* chore: extend e2e-run-notebook timeout

* e2e tests fail if trainjobs launched by notebook do not finish in 3s
* extends the timeout to 5min to block and wait for longer trainjobs until timeout or trainjob completes

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

* chore: update example to wait for trainjob running status

* revert change to e2e-run-notebook.sh

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>

---------

Signed-off-by: solanyn <14799876+solanyn@users.noreply.github.com>
Signed-off-by: Andrew Chen <14799876+solanyn@users.noreply.github.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants