Skip to content

Audio2Text Retrieval AbsTask and Evaluator + Audiocaps Retrieval Dataset #2684

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

switchpiggy
Copy link

@switchpiggy switchpiggy commented May 9, 2025

Resolves #2068

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

@switchpiggy switchpiggy changed the title Maeb retrieval Audio2Text Retrieval AbsTask and Evaluator + Audiocaps Retrieval Dataset May 9, 2025
@isaac-chung isaac-chung linked an issue May 9, 2025 that may be closed by this pull request
@isaac-chung isaac-chung requested review from Samoed and Copilot May 9, 2025 15:59
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces a new AudioCaps Retrieval task for Audio2Text retrieval and integrates it into the task registry while also removing an obsolete text file.

  • Added a new task implementation in AudioCapsRetrieval with associated metadata.
  • Updated module imports to expose the new task.
  • Removed the legacy "the_ugly_duckling.txt" file.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

File Description
mteb/tasks/init.py Added import for the Audio2TextRetrieval tasks.
mteb/tasks/Audio/Audio2TextRetrieval/eng/AudioCapsRetrieval.py New task implementation defining AudioCaps Retrieval metadata.
mteb/tasks/Audio/Audio2TextRetrieval/init.py Exposed AudioCapsRetrieval via module import.
mteb/abstasks/the_ugly_duckling.txt Removed legacy file.

Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start!

dataset={
"path": "TwinkStart/AudioCaps",
"revision": "8fc8b151149af779517aedfbf8c536160822bd70",
"trust_remote_code": True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"trust_remote_code": True,

description="Measuring the ability to retrieve the groundtruth answers to reasoning task queries on ARC-Challenge.",
reference="https://allenai.org/data/arc",
dataset={
"path": "TwinkStart/AudioCaps",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this dataset? It weights 153GB

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably downsample + negative mine this to a reasonable size.

Comment on lines +473 to +483
elif (
hasattr(self.retriever.model.model, "mteb_model_meta")
and self.retriever.model.model.mteb_model_meta.name == "bm25s"
):
return self.retriever.model.model.search(
corpus,
queries,
self.top_k,
score_function="bm25",
task_name=self.task_name, # type: ignore
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif (
hasattr(self.retriever.model.model, "mteb_model_meta")
and self.retriever.model.model.mteb_model_meta.name == "bm25s"
):
return self.retriever.model.model.search(
corpus,
queries,
self.top_k,
score_function="bm25",
task_name=self.task_name, # type: ignore
)

logger = logging.getLogger(__name__)


def corpus_to_str(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this?

if instructions:
queries = [f"{query} {instructions[query]}".strip() for query in queries]
if isinstance(queries[0], list): # type: ignore
query_embeddings = self.encode_conversations(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this?

Comment on lines +41 to +43
id_column_name: str = '_id',
audio_column_name: str = 'audio',
text_column_name: str = 'text'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you find more datasets you want to add as retrieval and after that we can create better loader? MIEB and MTEB retrieval tasks have different configurations for corpus and queries with qrelsl. I don't think we should change this format

@KennethEnevoldsen
Copy link
Contributor

@switchpiggy, it seems like this PR might have gotten stale. Any plans to finish it up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create audio-text retrieval AbsTask and Evaluator
5 participants