Add context_logits for eval accuracy calculation in case of multi token prediction tasks#11753
Add context_logits for eval accuracy calculation in case of multi token prediction tasks#11753oyilmaz-nvidia merged 11 commits intomainfrom
Conversation
630966e to
5fb26b1
Compare
4bd6b47 to
8cbaef3
Compare
cd4e7e6 to
900ad07
Compare
07925a5 to
f0c3cb3
Compare
oyilmaz-nvidia
left a comment
There was a problem hiding this comment.
Looks good but can you please run the examples here https://docs.nvidia.com/nemo-framework/user-guide/latest/deployment/llm/optimized/tensorrt_llm.html and make sure nothing is broken?
|
[🤖]: Hi @athitten 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully So it might be time to merge this PR or get some approvals I'm just a bot so I'll leave it you what to do next. //cc @pablo-garay @ko3n1g |
|
LGTM for |
Hi @oyilmaz-nvidia ran the scripts you pointed with HF llama3-8b converted to nemo2. No errors, everything worked fine. Here's the output python scripts/deploy/nlp/deploy_triton.py --nemo_check
point /workspace/hf_llama3_8b_nemo2_new.nemo --model_type 'llama' --triton_model_name 'llama3-8b' -
-tensor_parallelism_size 1Output of | cuda_memory_pool_byte_size{1} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+------------------------------------------+
I0110 00:42:14.989638 1098739 grpc_server.cc:2558] "Started GRPCInferenceService at 0.0.0.0:8001"
I0110 00:42:14.989788 1098739 http_server.cc:4713] "Started HTTPService at 0.0.0.0:8000"
I0110 00:42:15.030635 1098739 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
E0110 00:42:15.571184 1098739 model_repository_manager.cc:470] "Failed to set config modification time: model_config_content_name_ is empty"
I0110 00:42:15.571553 1098739 model_lifecycle.cc:472] "loading: llama3-8b:1"
I0110 00:42:17.003485 1098739 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: llama3-8b_0_0 (CPU device 0)"
I0110 00:42:17.484886 1098739 model_lifecycle.cc:839] "successfully loaded 'llama3-8b'"
[01/10/2025-00:42:17] Model serving on Triton is will be started.Inference request: python scripts/deploy/nlp/query.py -mn 'llama3-8b' -p "Hi, how are you?" -mol 20 |
94de135 to
7e43c11
Compare
Uses bool generation_logits_available as inputs dict does not contain it Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: athitten <athitten@users.noreply.github.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: athitten <athitten@users.noreply.github.com>
for more information, see https://pre-commit.ci
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
7e43c11 to
1a3f66d
Compare
|
beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base. Your code was analyzed with PyLint. The following annotations have been identified: Mitigation guide:
By applying these rules, we reduce the occurance of this message in future. Thank you for improving NeMo's documentation! |
|
[🤖]: Hi @athitten 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully So it might be time to merge this PR or get some approvals I'm just a bot so I'll leave it you what to do next. //cc @pablo-garay @ko3n1g |
…en prediction tasks (#11753) * Add server ready check before evaluation Uses bool generation_logits_available as inputs dict does not contain it Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add context logits Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove max_tokens_to_generate and add more comments Signed-off-by: Abhishree <abhishreetm@gmail.com> * Apply isort and black reformatting Signed-off-by: athitten <athitten@users.noreply.github.com> * Get context_logits for multi token prediction tasks Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix bug with single/multi token condition check Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply isort and black reformatting Signed-off-by: athitten <athitten@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bugfix with output_context_logits Signed-off-by: Abhishree <abhishreetm@gmail.com> --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: athitten <athitten@users.noreply.github.com> Co-authored-by: athitten <athitten@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Abhinav Garg <abhgarg@nvidia.com>
…en prediction tasks (NVIDIA-NeMo#11753) * Add server ready check before evaluation Uses bool generation_logits_available as inputs dict does not contain it Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add context logits Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove max_tokens_to_generate and add more comments Signed-off-by: Abhishree <abhishreetm@gmail.com> * Apply isort and black reformatting Signed-off-by: athitten <athitten@users.noreply.github.com> * Get context_logits for multi token prediction tasks Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix bug with single/multi token condition check Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply isort and black reformatting Signed-off-by: athitten <athitten@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bugfix with output_context_logits Signed-off-by: Abhishree <abhishreetm@gmail.com> --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: athitten <athitten@users.noreply.github.com> Co-authored-by: athitten <athitten@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>


What does this PR do ?
This PR adds the following changes:
Uses
context_logitsto computelogProbsin case of eval benchmarks that have multiple token prediction. For ex:arc_challenge,arc_easy,winogrande,copaetc., (where asMMLU,lambadaare single token prediction benchmarks. Usesgeneration_logitsfor these benchmarks to avoid large payload fromcontext_logits)In order to get
context_logitsfor evals, the PR exposesgather_context_logitsandoutput_contextlogits inexportanddeployfiles similar to previously existedgeneration_logitsIntroduces
requirements_eval.txtfile to installlm-eval-harnessin NeMo FW containers.Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use thisGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information