-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
I was running GLUE benchmarking tasks such as CoLA, QQP, and SST-2 using lm-eval.
By default, the output_type is set to multiple_choice, which internally calls loglikelihood.
However, with chat-based models (e.g., OpenAI chat completions), this fails with the following error:
Traceback (most recent call last):
File "/home/myusername/lm-evaluation-harness/myenv/bin/lm_eval", line 8, in
sys.exit(cli_evaluate())
File "/home/myusername/lm-evaluation-harness/lm_eval/main.py", line 450, in cli_evaluate
results = evaluator.simple_evaluate(
...
File "/home/myusername/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 186, in loglikelihood
raise NotImplementedError(
NotImplementedError: Loglikelihood is not supported for chat completions. Consider using the completions API instead.
sample yaml :
tag: glue
task: cola
dataset_path: nyu-mll/glue
dataset_name: cola
output_type: multiple_choice
training_split: train
validation_split: validation
doc_to_text: "{{sentence}}\nQuestion: Does this sentence make sense?\nAnswer:"
doc_to_target: label
doc_to_choice: ["no", "yes"]
should_decontaminate: true
doc_to_decontamination_query: sentence
metric_list:
- metric: mcc
metadata:
version: 1.0
Bash Command Used:
#!/bin/bash
export HF_HOME=/data
export HF_ALLOW_CODE_EVAL=1
export HF_TOKEN=
export HUGGINGFACE_HUB_TOKEN=
export MODEL_NAME="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
export BASE_URL="some url"
export AUTH_TOKEN="sk-************"
lm_eval --model local-chat-completions
--tasks gsm8k
--apply_chat_template
--model_args "model=$MODEL_NAME,base_url=$BASE_URL,num_concurrent=1,max_retries=3,tokenizer=/data/local_tokenizer,tokenized_requests=False"
What I Tried
Changing output_type to generate_until in the YAML config.
This runs, but for CoLA, it only returns 0.0 values (not useful).
My Questions
Is it valid/appropriate to use a different output_type like generate_until for GLUE tasks?
Will this lead to wrong evaluation results compared to the intended benchmark setup?
How can I correctly configure the YAML (or adapt the task setup) so that GLUE tasks work with chat models that don’t support loglikelihood?