Update model_loader.py #7

yc-li20 · 2023-11-10T00:20:16Z

add
wav2vec 2.0 [base, large];
hubert [base, large];
wavlm [base, baseplus, large];
whisper [base]
to utilize fadtk for speech analysis

add wav2vec 2.0 [base, large]; hubert [base, large]; wavlm [base, baseplus, large]; whisper [base] to utilize fadtk for speech analysis

yc-li20 · 2023-11-10T00:23:01Z

@microsoft-github-policy-service agree

hykilpikonna · 2023-11-11T03:20:44Z

Thank you so much for adding support for more embedding models! I will review the code changes in a moment.

hykilpikonna · 2023-11-11T03:29:04Z

fadtk/model_loader.py

+class W2V2baseModel(ModelLoader):
+    """
+    W2V2base model from https://huggingface.co/facebook/wav2vec2-base-960h
+
+    Please specify the layer to use (1-12).
+    """
+    def __init__(self, size='960h', layer=12, limit_minutes=6):
+        super().__init__(f"w2v2base" + ("" if layer == 12 else f"-{layer}"), 768, 16000)
+        self.huggingface_id = f"facebook/wav2vec2-base-{size}"
+        self.layer = layer
+        self.limit = limit_minutes * 60 * self.sr
+
+    def load_model(self):
+        from transformers import AutoProcessor
+        from transformers import Wav2Vec2Model
+
+        self.model = Wav2Vec2Model.from_pretrained(self.huggingface_id)
+        self.processor = AutoProcessor.from_pretrained(self.huggingface_id)
+        self.model.to(self.device)
+
+    def _get_embedding(self, audio: np.ndarray) -> np.ndarray:
+        # Limit to 9 minutes
+        if audio.shape[0] > self.limit:
+            log.warning(f"Audio is too long ({audio.shape[0] / self.sr / 60:.2f} minutes > {self.limit / self.sr / 60:.2f} minutes). Truncating.")
+            audio = audio[:self.limit]
+
+        inputs = self.processor(audio, sampling_rate=self.sr, return_tensors="pt").to(self.device)
+        with torch.no_grad():
+            out = self.model(**inputs, output_hidden_states=True)
+            out = torch.stack(out.hidden_states).squeeze() # [13 layers, timeframes, 768]
+            out = out[self.layer] # [timeframes, 768]
+
+        return out
+
+
+class W2V2largeModel(ModelLoader):
+    """
+    W2V2large model from https://huggingface.co/facebook/wav2vec2-large-960h
+
+    Please specify the layer to use (1-24).
+    """
+    def __init__(self, size='960h', layer=24, limit_minutes=6):
+        super().__init__(f"w2v2large" + ("" if layer == 24 else f"-{layer}"), 1024, 16000)
+        self.huggingface_id = f"facebook/wav2vec2-large-{size}"
+        self.layer = layer
+        self.limit = limit_minutes * 60 * self.sr
+
+    def load_model(self):
+        from transformers import AutoProcessor
+        from transformers import Wav2Vec2Model
+
+        self.model = Wav2Vec2Model.from_pretrained(self.huggingface_id)
+        self.processor = AutoProcessor.from_pretrained(self.huggingface_id)
+        self.model.to(self.device)
+
+    def _get_embedding(self, audio: np.ndarray) -> np.ndarray:
+        # Limit to 9 minutes
+        if audio.shape[0] > self.limit:
+            log.warning(f"Audio is too long ({audio.shape[0] / self.sr / 60:.2f} minutes > {self.limit / self.sr / 60:.2f} minutes). Truncating.")
+            audio = audio[:self.limit]
+
+        inputs = self.processor(audio, sampling_rate=self.sr, return_tensors="pt").to(self.device)
+        with torch.no_grad():
+            out = self.model(**inputs, output_hidden_states=True)
+            out = torch.stack(out.hidden_states).squeeze() # [25 layers, timeframes, 1024]
+            out = out[self.layer] # [timeframes, 1024]
+
+        return out
+


I see that the code for W2V2 base and W2V2 large are mostly identical. Would it be better to reuse some duplicate parts of the code by using abstractions? (e.g. defining and extending from a base class for each model family containing the duplicated functions)

This also applies to the base and large variants of HuBERT, and WavLM

Updated!
I also integrated more model types of whisper. Now, the most widely used speech models, i.e., w2v2, hubert, wavlm, and whisper have been included, which should be sufficient for the majority of speech tasks.

reusing duplicate parts of w2v2, hubert, and wavlm integrating mode types of whisper These four models are the most widely used ones in the speech area.

hykilpikonna · 2023-11-20T21:59:31Z

Thanks! I just merged the pull request. By the way, are there any extra dependencies required for these models? From what I see in the code changes, the only conditional import you used is transformers, which is already included in our dependencies, is that correct?

yc-li20 · 2023-11-20T22:53:36Z

Yep. No other dependencies required. Thanks for merging!

add citation

Update model_loader.py

f8fa760

add wav2vec 2.0 [base, large]; hubert [base, large]; wavlm [base, baseplus, large]; whisper [base] to utilize fadtk for speech analysis

hykilpikonna reviewed Nov 11, 2023

View reviewed changes

Update model_loader.py

51e6780

reusing duplicate parts of w2v2, hubert, and wavlm integrating mode types of whisper These four models are the most widely used ones in the speech area.

hykilpikonna approved these changes Nov 20, 2023

View reviewed changes

hykilpikonna merged commit fcced8b into microsoft:main Nov 20, 2023

yc-li20 deleted the yc-li20-patch-1 branch November 22, 2023 14:39

christhetree pushed a commit to christhetree/fadtk that referenced this pull request Mar 24, 2025

Merge pull request microsoft#7 from jnwnlee/main

99a8d3e

add citation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update model_loader.py #7

Update model_loader.py #7

Uh oh!

yc-li20 commented Nov 10, 2023

Uh oh!

yc-li20 commented Nov 10, 2023

Uh oh!

hykilpikonna commented Nov 11, 2023

Uh oh!

hykilpikonna Nov 11, 2023 •

edited

Loading

Uh oh!

yc-li20 Nov 11, 2023

Uh oh!

hykilpikonna commented Nov 20, 2023

Uh oh!

yc-li20 commented Nov 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update model_loader.py #7

Update model_loader.py #7

Uh oh!

Conversation

yc-li20 commented Nov 10, 2023

Uh oh!

yc-li20 commented Nov 10, 2023

Uh oh!

hykilpikonna commented Nov 11, 2023

Uh oh!

hykilpikonna Nov 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yc-li20 Nov 11, 2023

Choose a reason for hiding this comment

Uh oh!

hykilpikonna commented Nov 20, 2023

Uh oh!

yc-li20 commented Nov 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hykilpikonna Nov 11, 2023 •

edited

Loading