Skip to content

Loglikelihood is not supported #3275

@arulp18

Description

@arulp18

I was running GLUE benchmarking tasks such as CoLA, QQP, and SST-2 using lm-eval.
By default, the output_type is set to multiple_choice, which internally calls loglikelihood.
However, with chat-based models (e.g., OpenAI chat completions), this fails with the following error:

Traceback (most recent call last):
File "/home/myusername/lm-evaluation-harness/myenv/bin/lm_eval", line 8, in
   sys.exit(cli_evaluate())
File "/home/myusername/lm-evaluation-harness/lm_eval/main.py", line 450, in cli_evaluate
   results = evaluator.simple_evaluate(
...
File "/home/myusername/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 186, in loglikelihood
   raise NotImplementedError(
NotImplementedError: Loglikelihood is not supported for chat completions. Consider using the completions API instead.

sample yaml :

tag: glue
task: cola
dataset_path: nyu-mll/glue
dataset_name: cola
output_type: multiple_choice
training_split: train
validation_split: validation
doc_to_text: "{{sentence}}\nQuestion: Does this sentence make sense?\nAnswer:"
doc_to_target: label
doc_to_choice: ["no", "yes"]
should_decontaminate: true
doc_to_decontamination_query: sentence
metric_list:

  • metric: mcc
    metadata:
    version: 1.0

Bash Command Used: 

#!/bin/bash

export HF_HOME=/data
export HF_ALLOW_CODE_EVAL=1
export HF_TOKEN=
export HUGGINGFACE_HUB_TOKEN=
export MODEL_NAME="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
export BASE_URL="some url" 
export AUTH_TOKEN="sk-************"

lm_eval --model local-chat-completions
       --tasks gsm8k
       --apply_chat_template
       --model_args "model=$MODEL_NAME,base_url=$BASE_URL,num_concurrent=1,max_retries=3,tokenizer=/data/local_tokenizer,tokenized_requests=False"

What I Tried

Changing output_type to generate_until in the YAML config.
This runs, but for CoLA, it only returns 0.0 values (not useful).

My Questions

Is it valid/appropriate to use a different output_type like generate_until for GLUE tasks?
Will this lead to wrong evaluation results compared to the intended benchmark setup?
How can I correctly configure the YAML (or adapt the task setup) so that GLUE tasks work with chat models that don’t support loglikelihood?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions