-
Notifications
You must be signed in to change notification settings - Fork 115
feat(audio): audio file and fragment with streaming #1200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideImplements end-to-end streaming support for audio files by extending the core File model with AudioFile and related DataModels, integrating torchaudio-based utilities, and ensuring seamless streaming within UDF pipelines. Entity relationship diagram for audio data modelserDiagram
AUDIOFILE ||--o{ AUDIOSEGMENT : contains
AUDIOFILE ||--o{ AUDIOFRAGMENT : contains
AUDIOFILE }|--|| AUDIO : has
AUDIOSEGMENT }|--|| AUDIOFILE : references
AUDIOFRAGMENT }|--|| AUDIOFILE : references
AUDIO {
int sample_rate
int channels
float duration
int samples
string format
string codec
int bit_rate
}
Class diagram for new and updated audio-related typesclassDiagram
class File {
+as_audio_file() AudioFile
}
class AudioFile {
+get_info() Audio
+get_segment(start: float, duration: float) AudioSegment
+get_fragment(start: float, end: float) AudioFragment
+get_fragments(duration: float, start: float, end: float) Iterator~AudioFragment~
}
class AudioSegment {
+audio: AudioFile
+start: float
+end: float
+get_np() tuple[ndarray, int]
+read_bytes(format: str) bytes
}
class AudioFragment {
+audio: AudioFile
+start: float
+end: float
+get_np() tuple[ndarray, int]
+read_bytes(format: str) bytes
+save(output: str, format: str) AudioFile
}
class Audio {
+sample_rate: int
+channels: int
+duration: float
+samples: int
+format: str
+codec: str
+bit_rate: int
}
File <|-- AudioFile
AudioFile o-- AudioSegment : contains
AudioFile o-- AudioFragment : contains
AudioSegment o-- AudioFile : references
AudioFragment o-- AudioFile : references
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Deploying datachain-documentation with
|
Latest commit: |
2ad01e2
|
Status: | ✅ Deploy successful! |
Preview URL: | https://54a0aae4.datachain-documentation.pages.dev |
Branch Preview URL: | https://audio-fragments-decoder.datachain-documentation.pages.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @shcheklein - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/datachain/lib/file.py:1058` </location>
<code_context>
+ duration = self.end - self.start
+ return audio_segment_bytes(self.audio, self.start, duration, format)
+
+ def save(self, output: str, format: Optional[str] = None) -> "AudioFile":
+ """
+ Saves the audio fragment as a new audio file.
</code_context>
<issue_to_address>
AudioFragment.save may overwrite files without warning.
Consider adding a check or warning if the output file already exists to prevent accidental overwrites.
</issue_to_address>
### Comment 2
<location> `src/datachain/lib/audio.py:152` </location>
<code_context>
+ import soundfile as sf
+
+ buffer = io.BytesIO()
+ sf.write(buffer, y, sr, format=format)
+ return buffer.getvalue()
+
</code_context>
<issue_to_address>
audio_segment_bytes assumes soundfile supports all requested formats.
Validate the format argument before calling sf.write, or catch exceptions to provide a clearer error message if the format is unsupported.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
import soundfile as sf
buffer = io.BytesIO()
sf.write(buffer, y, sr, format=format)
return buffer.getvalue()
=======
import soundfile as sf
buffer = io.BytesIO()
# Validate format before writing
supported_formats = set(sf.available_formats().keys())
if format.upper() not in supported_formats:
raise ValueError(f"Unsupported audio format '{format}'. Supported formats: {', '.join(sorted(supported_formats))}")
try:
sf.write(buffer, y, sr, format=format)
except RuntimeError as e:
raise ValueError(f"Failed to write audio with format '{format}': {e}")
return buffer.getvalue()
>>>>>>> REPLACE
</suggested_fix>
### Comment 3
<location> `src/datachain/lib/audio.py:182` </location>
<code_context>
+ if start < 0 or end < 0 or start >= end:
+ raise ValueError(f"Invalid time range: ({start:.3f}, {end:.3f})")
+
+ if format is None:
+ format = audio.get_file_ext()
+
</code_context>
<issue_to_address>
save_audio_fragment infers format from file extension, which may be unreliable.
Consider adding validation or fallback logic to handle cases where the file extension is missing, non-standard, or does not match the actual audio format.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
if format is None:
format = audio.get_file_ext()
=======
if format is None:
format = audio.get_file_ext()
# Validate the inferred format
valid_formats = {"wav", "mp3", "flac", "ogg", "aac", "m4a"}
if not format or format.lower() not in valid_formats:
# Fallback to default format if extension is missing or non-standard
import warnings
warnings.warn(
f"Could not reliably infer audio format from file extension '{format}'. "
"Falling back to default format 'wav'."
)
format = "wav"
>>>>>>> REPLACE
</suggested_fix>
### Comment 4
<location> `src/datachain/lib/udf.py:277` </location>
<code_context>
+ self._set_stream_recursive(obj, catalog, cache, download_cb)
return obj_row
+ def _set_stream_recursive(
+ self, obj: Any, catalog: "Catalog", cache: bool, download_cb: Callback
+ ) -> None:
+ """Recursively set the catalog stream on all File objects within an object."""
+ if isinstance(obj, File):
+ obj._set_stream(catalog, caching_enabled=cache, download_cb=download_cb)
+
+ # Check all fields for nested File objects, but only for DataModel objects
+ if isinstance(obj, DataModel):
+ for field_name in obj.model_fields:
+ field_value = getattr(obj, field_name, None)
+ if isinstance(field_value, DataModel):
+ self._set_stream_recursive(field_value, catalog, cache, download_cb)
+
</code_context>
<issue_to_address>
The recursive stream setter does not handle lists or dicts of DataModels.
Please update the recursion logic to handle lists and dictionaries containing DataModels or Files, so all nested File objects are initialized correctly.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
duration = self.end - self.start | ||
return audio_segment_bytes(self.audio, self.start, duration, format) | ||
|
||
def save(self, output: str, format: Optional[str] = None) -> "AudioFile": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): AudioFragment.save may overwrite files without warning.
Consider adding a check or warning if the output file already exists to prevent accidental overwrites.
import soundfile as sf | ||
|
||
buffer = io.BytesIO() | ||
sf.write(buffer, y, sr, format=format) | ||
return buffer.getvalue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): audio_segment_bytes assumes soundfile supports all requested formats.
Validate the format argument before calling sf.write, or catch exceptions to provide a clearer error message if the format is unsupported.
import soundfile as sf | |
buffer = io.BytesIO() | |
sf.write(buffer, y, sr, format=format) | |
return buffer.getvalue() | |
import soundfile as sf | |
buffer = io.BytesIO() | |
# Validate format before writing | |
supported_formats = set(sf.available_formats().keys()) | |
if format.upper() not in supported_formats: | |
raise ValueError(f"Unsupported audio format '{format}'. Supported formats: {', '.join(sorted(supported_formats))}") | |
try: | |
sf.write(buffer, y, sr, format=format) | |
except RuntimeError as e: | |
raise ValueError(f"Failed to write audio with format '{format}': {e}") | |
return buffer.getvalue() |
if format is None: | ||
format = audio.get_file_ext() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): save_audio_fragment infers format from file extension, which may be unreliable.
Consider adding validation or fallback logic to handle cases where the file extension is missing, non-standard, or does not match the actual audio format.
if format is None: | |
format = audio.get_file_ext() | |
if format is None: | |
format = audio.get_file_ext() | |
# Validate the inferred format | |
valid_formats = {"wav", "mp3", "flac", "ogg", "aac", "m4a"} | |
if not format or format.lower() not in valid_formats: | |
# Fallback to default format if extension is missing or non-standard | |
import warnings | |
warnings.warn( | |
f"Could not reliably infer audio format from file extension '{format}'. " | |
"Falling back to default format 'wav'." | |
) | |
format = "wav" |
def _set_stream_recursive( | ||
self, obj: Any, catalog: "Catalog", cache: bool, download_cb: Callback | ||
) -> None: | ||
"""Recursively set the catalog stream on all File objects within an object.""" | ||
if isinstance(obj, File): | ||
obj._set_stream(catalog, caching_enabled=cache, download_cb=download_cb) | ||
|
||
# Check all fields for nested File objects, but only for DataModel objects | ||
if isinstance(obj, DataModel): | ||
for field_name in obj.model_fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: The recursive stream setter does not handle lists or dicts of DataModels.
Please update the recursion logic to handle lists and dictionaries containing DataModels or Files, so all nested File objects are initialized correctly.
|
||
try: | ||
with audio.open() as f: | ||
info = torchaudio.info(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Extract code out into function (
extract-method
) - Replace if statement with if expression (
assign-if-exp
)
raise ValueError(f"Invalid time range: ({start:.3f}, {end:.3f})") | ||
|
||
if format is None: | ||
format = audio.get_file_ext() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Don't assign to builtin variable format
(avoid-builtin-shadow
)
Explanation
Python has a number ofbuiltin
variables: functions and constants thatform a part of the language, such as
list
, getattr
, and type
(See https://docs.python.org/3/library/functions.html).
It is valid, in the language, to re-bind such variables:
list = [1, 2, 3]
However, this is considered poor practice.
- It will confuse other developers.
- It will confuse syntax highlighters and linters.
- It means you can no longer use that builtin for its original purpose.
How can you solve this?
Rename the variable something more specific, such as integers
.
In a pinch, my_list
and similar names are colloquially-recognized
placeholders.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1200 +/- ##
==========================================
- Coverage 88.71% 87.97% -0.75%
==========================================
Files 152 153 +1
Lines 13557 13704 +147
Branches 1884 1904 +20
==========================================
+ Hits 12027 12056 +29
- Misses 1088 1203 +115
- Partials 442 445 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Add Audio and AudioFragment support, to make code like below work with full streaming (no file download happening):
TODO: