-
Notifications
You must be signed in to change notification settings - Fork 440
Audio2Text Retrieval AbsTask and Evaluator + Audiocaps Retrieval Dataset #2684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces a new AudioCaps Retrieval task for Audio2Text retrieval and integrates it into the task registry while also removing an obsolete text file.
- Added a new task implementation in AudioCapsRetrieval with associated metadata.
- Updated module imports to expose the new task.
- Removed the legacy "the_ugly_duckling.txt" file.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
File | Description |
---|---|
mteb/tasks/init.py | Added import for the Audio2TextRetrieval tasks. |
mteb/tasks/Audio/Audio2TextRetrieval/eng/AudioCapsRetrieval.py | New task implementation defining AudioCaps Retrieval metadata. |
mteb/tasks/Audio/Audio2TextRetrieval/init.py | Exposed AudioCapsRetrieval via module import. |
mteb/abstasks/the_ugly_duckling.txt | Removed legacy file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start!
dataset={ | ||
"path": "TwinkStart/AudioCaps", | ||
"revision": "8fc8b151149af779517aedfbf8c536160822bd70", | ||
"trust_remote_code": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"trust_remote_code": True, |
description="Measuring the ability to retrieve the groundtruth answers to reasoning task queries on ARC-Challenge.", | ||
reference="https://allenai.org/data/arc", | ||
dataset={ | ||
"path": "TwinkStart/AudioCaps", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this dataset? It weights 153GB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably downsample + negative mine this to a reasonable size.
elif ( | ||
hasattr(self.retriever.model.model, "mteb_model_meta") | ||
and self.retriever.model.model.mteb_model_meta.name == "bm25s" | ||
): | ||
return self.retriever.model.model.search( | ||
corpus, | ||
queries, | ||
self.top_k, | ||
score_function="bm25", | ||
task_name=self.task_name, # type: ignore | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif ( | |
hasattr(self.retriever.model.model, "mteb_model_meta") | |
and self.retriever.model.model.mteb_model_meta.name == "bm25s" | |
): | |
return self.retriever.model.model.search( | |
corpus, | |
queries, | |
self.top_k, | |
score_function="bm25", | |
task_name=self.task_name, # type: ignore | |
) |
logger = logging.getLogger(__name__) | ||
|
||
|
||
def corpus_to_str( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this?
if instructions: | ||
queries = [f"{query} {instructions[query]}".strip() for query in queries] | ||
if isinstance(queries[0], list): # type: ignore | ||
query_embeddings = self.encode_conversations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this?
id_column_name: str = '_id', | ||
audio_column_name: str = 'audio', | ||
text_column_name: str = 'text' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you find more datasets you want to add as retrieval and after that we can create better loader? MIEB and MTEB retrieval tasks have different configurations for corpus
and queries
with qrelsl
. I don't think we should change this format
@switchpiggy, it seems like this PR might have gotten stale. Any plans to finish it up? |
Resolves #2068
Code Quality
make lint
to maintain consistent style.Documentation
Testing
make test-with-coverage
.make test
ormake test-with-coverage
to ensure no existing functionality is broken.