Loglikelihood is not supported

I was running GLUE benchmarking tasks such as CoLA, QQP, and SST-2 using lm-eval.
By default, the output_type is set to multiple_choice, which internally calls loglikelihood.
However, with chat-based models (e.g., OpenAI chat completions), this fails with the following error: 



Traceback (most recent call last):
  File "/home/myusername/lm-evaluation-harness/myenv/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/myusername/lm-evaluation-harness/lm_eval/__main__.py", line 450, in cli_evaluate
    results = evaluator.simple_evaluate(
  ...
  File "/home/myusername/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 186, in loglikelihood
    raise NotImplementedError(
NotImplementedError: Loglikelihood is not supported for chat completions. Consider using the completions API instead.



sample yaml : 

tag: glue
task: cola
dataset_path: nyu-mll/glue
dataset_name: cola
output_type: multiple_choice
training_split: train
validation_split: validation
doc_to_text: "{{sentence}}\nQuestion: Does this sentence make sense?\nAnswer:"
doc_to_target: label
doc_to_choice: ["no", "yes"]
should_decontaminate: true
doc_to_decontamination_query: sentence
metric_list:
  - metric: mcc
metadata:
  version: 1.0



Bash Command Used: 
 

#!/bin/bash



export HF_HOME=/data
export HF_ALLOW_CODE_EVAL=1
export HF_TOKEN=
export HUGGINGFACE_HUB_TOKEN=
export MODEL_NAME="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
export BASE_URL="some url" 
export AUTH_TOKEN="sk-************"



lm_eval --model local-chat-completions \
        --tasks gsm8k \
        --apply_chat_template \
        --model_args "model=$MODEL_NAME,base_url=$BASE_URL,num_concurrent=1,max_retries=3,tokenizer=/data/local_tokenizer,tokenized_requests=False"

 



What I Tried

Changing output_type to generate_until in the YAML config.
This runs, but for CoLA, it only returns 0.0 values (not useful).

My Questions

Is it valid/appropriate to use a different output_type like generate_until for GLUE tasks?
Will this lead to wrong evaluation results compared to the intended benchmark setup?
How can I correctly configure the YAML (or adapt the task setup) so that GLUE tasks work with chat models that don’t support loglikelihood?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loglikelihood is not supported #3275

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loglikelihood is not supported #3275

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions