Skip to content

feat(audio): audio file and fragment with streaming #1200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,9 @@ torch = [
"torchvision",
"transformers>=4.36.0"
]
audio = [
"torchaudio"
]
remote = [
"lz4",
"requests>=2.22.0"
Expand All @@ -90,7 +93,7 @@ video = [
"opencv-python"
]
tests = [
"datachain[torch,remote,vector,hf,video]",
"datachain[torch,audio,remote,vector,hf,video]",
"pytest>=8,<9",
"pytest-sugar>=0.9.6",
"pytest-cov>=4.1.0",
Expand Down
6 changes: 6 additions & 0 deletions src/datachain/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@
)
from datachain.lib.file import (
ArrowRow,
Audio,
AudioFile,
AudioFragment,
File,
FileError,
Image,
Expand All @@ -42,6 +45,9 @@
"AbstractUDF",
"Aggregator",
"ArrowRow",
"Audio",
"AudioFile",
"AudioFragment",
"C",
"Column",
"DataChain",
Expand Down
202 changes: 202 additions & 0 deletions src/datachain/lib/audio.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
from __future__ import annotations

Check warning on line 1 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L1

Added line #L1 was not covered by tests

import posixpath
from typing import TYPE_CHECKING

Check warning on line 4 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L3-L4

Added lines #L3 - L4 were not covered by tests

from datachain.lib.file import FileError

Check warning on line 6 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L6

Added line #L6 was not covered by tests

if TYPE_CHECKING:
from numpy import ndarray

from datachain.lib.file import Audio, AudioFile, File

try:
import torchaudio
except ImportError as exc:
raise ImportError(

Check warning on line 16 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L13-L16

Added lines #L13 - L16 were not covered by tests
"Missing dependencies for processing audio.\n"
"To install run:\n\n"
" pip install 'datachain[audio]'\n"
) from exc


def audio_info(file: File | AudioFile) -> Audio:

Check warning on line 23 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L23

Added line #L23 was not covered by tests
"""
Returns audio file information.

Args:
file (AudioFile): Audio file object.

Returns:
Audio: Audio file information.
"""
# Import here to avoid circular imports
from datachain.lib.file import Audio

Check warning on line 34 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L34

Added line #L34 was not covered by tests

file = file.as_audio_file()

Check warning on line 36 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L36

Added line #L36 was not covered by tests

try:
with file.open() as f:
info = torchaudio.info(f)

Check warning on line 40 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L38-L40

Added lines #L38 - L40 were not covered by tests

sample_rate = int(info.sample_rate)
channels = int(info.num_channels)
frames = int(info.num_frames)
duration = float(frames / sample_rate) if sample_rate > 0 else 0.0

Check warning on line 45 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L42-L45

Added lines #L42 - L45 were not covered by tests

# Get format information
format_name = getattr(info, "format", "")
codec_name = getattr(info, "encoding", "")
bit_rate = getattr(info, "bits_per_sample", 0) * sample_rate * channels

Check warning on line 50 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L48-L50

Added lines #L48 - L50 were not covered by tests

except Exception as exc:
raise FileError(

Check warning on line 53 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L52-L53

Added lines #L52 - L53 were not covered by tests
"unable to extract metadata from audio file", file.source, file.path
) from exc

return Audio(

Check warning on line 57 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L57

Added line #L57 was not covered by tests
sample_rate=sample_rate,
channels=channels,
duration=duration,
samples=frames,
format=format_name,
codec=codec_name,
bit_rate=bit_rate,
)


def audio_segment_np(

Check warning on line 68 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L68

Added line #L68 was not covered by tests
audio: AudioFile, start: float = 0, duration: float | None = None
) -> tuple[ndarray, int]:
"""
Reads audio segment from a file and returns as numpy array.

Args:
audio (AudioFile): Audio file object.
start (float): Start time in seconds (default: 0).
duration (float, optional): Duration in seconds. If None, reads to end.

Returns:
tuple[ndarray, int]: Audio data and sample rate.
"""
if start < 0:
raise ValueError("start must be a non-negative float")

Check warning on line 83 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L83

Added line #L83 was not covered by tests

if duration is not None and duration <= 0:
raise ValueError("duration must be a positive float")

Check warning on line 86 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L86

Added line #L86 was not covered by tests

# Ensure we have an AudioFile instance
if hasattr(audio, "as_audio_file"):
audio = audio.as_audio_file()

Check warning on line 90 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L90

Added line #L90 was not covered by tests

try:
with audio.open() as f:
info = torchaudio.info(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

sample_rate = info.sample_rate

Check warning on line 95 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L92-L95

Added lines #L92 - L95 were not covered by tests

# Calculate frame offset and number of frames
frame_offset = int(start * sample_rate)
num_frames = int(duration * sample_rate) if duration is not None else -1

Check warning on line 99 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L98-L99

Added lines #L98 - L99 were not covered by tests

# Reset position before loading (critical for FileWrapper)
f.seek(0)

Check warning on line 102 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L102

Added line #L102 was not covered by tests

# Load the audio segment
waveform, sr = torchaudio.load(

Check warning on line 105 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L105

Added line #L105 was not covered by tests
f, frame_offset=frame_offset, num_frames=num_frames
)

audio_np = waveform.numpy()

Check warning on line 109 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L109

Added line #L109 was not covered by tests

# If stereo, take the mean across channels or return multi-channel
if audio_np.shape[0] > 1:
# For compatibility, we can either return multi-channel or mono
# Here returning multi-channel as (samples, channels)
audio_np = audio_np.T

Check warning on line 115 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L115

Added line #L115 was not covered by tests
else:
# Mono: shape (samples,)
audio_np = audio_np.squeeze()

Check warning on line 118 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L118

Added line #L118 was not covered by tests

return audio_np, int(sr)
except Exception as exc:
raise FileError(

Check warning on line 122 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L120-L122

Added lines #L120 - L122 were not covered by tests
"unable to read audio segment", audio.source, audio.path
) from exc


def audio_segment_bytes(

Check warning on line 127 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L127

Added line #L127 was not covered by tests
audio: AudioFile,
start: float = 0,
duration: float | None = None,
format: str = "wav",
) -> bytes:
"""
Reads audio segment from a file and returns as audio bytes.

Args:
audio (AudioFile): Audio file object.
start (float): Start time in seconds (default: 0).
duration (float, optional): Duration in seconds. If None, reads to end.
format (str): Audio format (default: 'wav').

Returns:
bytes: Audio segment as bytes.
"""
y, sr = audio_segment_np(audio, start, duration)

Check warning on line 145 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L145

Added line #L145 was not covered by tests

import io

Check warning on line 147 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L147

Added line #L147 was not covered by tests

import soundfile as sf

Check warning on line 149 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L149

Added line #L149 was not covered by tests

buffer = io.BytesIO()
sf.write(buffer, y, sr, format=format)
return buffer.getvalue()

Check warning on line 153 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L151-L153

Added lines #L151 - L153 were not covered by tests
Comment on lines +149 to +153
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): audio_segment_bytes assumes soundfile supports all requested formats.

Validate the format argument before calling sf.write, or catch exceptions to provide a clearer error message if the format is unsupported.

Suggested change
import soundfile as sf
buffer = io.BytesIO()
sf.write(buffer, y, sr, format=format)
return buffer.getvalue()
import soundfile as sf
buffer = io.BytesIO()
# Validate format before writing
supported_formats = set(sf.available_formats().keys())
if format.upper() not in supported_formats:
raise ValueError(f"Unsupported audio format '{format}'. Supported formats: {', '.join(sorted(supported_formats))}")
try:
sf.write(buffer, y, sr, format=format)
except RuntimeError as e:
raise ValueError(f"Failed to write audio with format '{format}': {e}")
return buffer.getvalue()



def save_audio_fragment(

Check warning on line 156 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L156

Added line #L156 was not covered by tests
audio: AudioFile,
start: float,
end: float,
output: str,
format: str | None = None,
) -> AudioFile:
"""
Saves audio interval as a new audio file. If output is a remote path,
the audio file will be uploaded to the remote storage.

Args:
audio (AudioFile): Audio file object.
start (float): Start time in seconds.
end (float): End time in seconds.
output (str): Output path, can be a local path or a remote path.
format (str, optional): Output format (default: None). If not provided,
the format will be inferred from the audio fragment
file extension.

Returns:
AudioFile: Audio fragment model.
"""
if start < 0 or end < 0 or start >= end:
raise ValueError(f"Invalid time range: ({start:.3f}, {end:.3f})")

Check warning on line 180 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L180

Added line #L180 was not covered by tests

if format is None:
format = audio.get_file_ext()

Check warning on line 183 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L183

Added line #L183 was not covered by tests
Comment on lines +182 to +183
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): save_audio_fragment infers format from file extension, which may be unreliable.

Consider adding validation or fallback logic to handle cases where the file extension is missing, non-standard, or does not match the actual audio format.

Suggested change
if format is None:
format = audio.get_file_ext()
if format is None:
format = audio.get_file_ext()
# Validate the inferred format
valid_formats = {"wav", "mp3", "flac", "ogg", "aac", "m4a"}
if not format or format.lower() not in valid_formats:
# Fallback to default format if extension is missing or non-standard
import warnings
warnings.warn(
f"Could not reliably infer audio format from file extension '{format}'. "
"Falling back to default format 'wav'."
)
format = "wav"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Don't assign to builtin variable format (avoid-builtin-shadow)


ExplanationPython has a number of builtin variables: functions and constants that
form a part of the language, such as list, getattr, and type
(See https://docs.python.org/3/library/functions.html).
It is valid, in the language, to re-bind such variables:

list = [1, 2, 3]

However, this is considered poor practice.

  • It will confuse other developers.
  • It will confuse syntax highlighters and linters.
  • It means you can no longer use that builtin for its original purpose.

How can you solve this?

Rename the variable something more specific, such as integers.
In a pinch, my_list and similar names are colloquially-recognized
placeholders.


duration = end - start
start_ms = int(start * 1000)
end_ms = int(end * 1000)
output_file = posixpath.join(

Check warning on line 188 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L185-L188

Added lines #L185 - L188 were not covered by tests
output, f"{audio.get_file_stem()}_{start_ms:06d}_{end_ms:06d}.{format}"
)

try:
audio_bytes = audio_segment_bytes(audio, start, duration, format)

Check warning on line 193 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L192-L193

Added lines #L192 - L193 were not covered by tests

from datachain.lib.file import AudioFile

Check warning on line 195 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L195

Added line #L195 was not covered by tests

return AudioFile.upload(audio_bytes, output_file, catalog=audio._catalog)

Check warning on line 197 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L197

Added line #L197 was not covered by tests

except Exception as exc:
raise FileError(

Check warning on line 200 in src/datachain/lib/audio.py

View check run for this annotation

Codecov / codecov/patch

src/datachain/lib/audio.py#L199-L200

Added lines #L199 - L200 were not covered by tests
"unable to save audio fragment", audio.source, audio.path
) from exc
Loading
Loading