Ensure you have the following installed:
- Python 3.12
- FFmpeg (version 7.1 or later)
- Mac: Use the following command to install FFmpeg:
brew install ffmpeg
- Linux: Using the FFmpeg Compilation Guide for Ubuntu to install the latest version and build from source to having full support features.
-
Create a Virtual Environment
python3 -m venv venv
-
Activate the Environment
source venv/bin/activate
-
Install Dependencies
pip3 install -r requirements.txt
-
Update Requirements File (if new libraries are added during development):
pip3 freeze > requirements.txt
- Copy the
.env.example
file:cp .env.example .env
- Add the necessary API keys and other environment-specific variables to the
.env
file.
Start the server with the following command:
uvicorn app.main:app --reload
Once running, access the Swagger UI for API documentation at: http://127.0.0.1:8000/docs
To process an .m3u8
input link (e.g., http://cache1.castiscdn.com:28080/snu/live.stream/tsmux_master.m3u8
):
-
Open the API documentation in Swagger UI: http://127.0.0.1:8000/docs#/live_stream/process_video_endpoint_api_v1_live_process_stream__post.
-
Use the
POST
endpoint to submit the.m3u8
URL for processing.
To view the processed .m3u8
output stream with subtitles, use HLS.js:
-
Install HLS.js
git clone https://github.com/video-dev/hls.js.git cd hls.js npm install npm run dev
-
Open the HLS.js Demo interface.
-
Enter the output endpoint from your server (e.g.,
http://127.0.0.1:8000/api/v1/streaming/index.m3u8
) into the HLS.js player to view the processed streaming video.
live-streaming-system/
├── app/
│ ├── __init__.py # Initialization file for the app module
│ ├── main.py # Entry point for the FastAPI application
│ ├── core/
│ │ ├── __init__.py
│ │ └── config.py # Configuration settings for the project
│ ├── api/
│ │ ├── __init__.py
│ │ ├── api_v1/
│ │ │ ├── __init__.py
│ │ │ ├── api.py # API routing for version 1
│ │ │ ├── endpoints/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── live_stream.py # Endpoints related to live streaming
│ ├── media/ # This folder will be automatically generated when processing the input streaming URL
│ │ ├── audio # Store the audio segmentation
| | ├── chunks # Store the video segmentation
| | ├── playlists # Store the playlist.m3u8 ~ Output file of our service
| | ├── subtitles # Store the subtitles (Output from Whisper model)
| | ├── translations # Store the translation (Having sub folder based on language code `vi`, `th`)
│ ├── models/
│ │ ├── __init__.py # Placeholder for database models
│ ├── schemas/
│ │ ├── __init__.py
│ │ ├── live_stream.py # Pydantic schemas for live streaming
│ ├── services/
│ │ ├── __init__.py
│ │ ├── live_stream_service.py # Implementation logic for live stream service
│ │ ├── audio_service.py # Implementation logic for audio service
│ │ ├── video_service.py # Implementation logic for video service
│ │ ├── stt_service.py # Implementation logic for speech-to-text service
│ │ ├── translation_service.py # Implementation logic for translation service
│ ├── static/ # Storing our frontend implementation using HLS library for demo as user side using our API endpoint
│ ├── workers/
│ │ ├── __init__.py
│ │ ├── background_tasks.py # Background task management
│ └── db/
│ ├── __init__.py
│ ├── base.py # Base model class for ORM
│ └── session.py # Database session management
├── benchmarking/ # Folder containing the data and implementation for benchmarking
│ ├── results/ # Folder containing benchmarking results
├── docs # Folder containing docs and public media
├── .env # Environment variables
├── .gitignore # Git ignore file
├── requirements.txt # Project dependencies
├── .github # GitHub configuration for CI/CD and GitHub PR/Issues Template
└── README.md # Project documentation
We ran 12 audio files with a total duration of 2 minutes.
- Groq: Consistently has the lowest transcription time, averaging 0.46 seconds, with little variation across all audio files.
- OpenAI: Moderate performance, with an average time of 1.60 seconds. There is a slight upward trend for some audio files (e.g., audio_3.wav and audio_6.wav).
- Whisper Local: Significantly slower, averaging 13.47 seconds per file. It shows a clear downward trend initially, stabilizing around 13–14 seconds after audio_2.wav.
For benchmarking the translation output, we prepared 12 .txt files in the subtitles folder. Each file is the output of the text-to-speech service from 10 seconds length. Thus, the total length of audio for evaluation is 2 minutes.
For each audio, we use XL8.ai and GPT-4 from OpenAI to make the translation data for Vietnamese and Thai languages.
We use the following metrics to compare the similarity between translations:
- TF-IDF: Provides quick similarity between two texts based on word occurrences. It assigns higher weights to words that appear frequently in a document but rarely in other documents.
- ChrF: Measures the similarity between two texts at the character level, making it more robust to paraphrasing, word reordering, and morphological variations.
- ROUGE-L: Evaluates the Longest Common Subsequence (LCS) between two texts. It measures structural similarity, including word overlap and sentence structure alignment.
- SBERT (Sentence-BERT): Measures the semantic similarity between two texts. It evaluates whether two sentences have the same meaning, regardless of word order or word choice. (Using Sentence Transformer model)
For this result, we first have figure showing the translation similarity metrics between XL8 and OpenAI for translating segmented audio files from Korean to Vietnamese. Similarity scores for TF-IDF, ChrF, and ROUGE-L ranged from 45–60%, indicating variations in word choice and sentence structure. However, the SBERT score averaged 80%, showing strong alignment in the meaning of the translations.
For Thai language, because there is less space for separating between independent word, thus Rouge L is less useful metrics. We can also see the high average results of SBERT similarity score. Therefore, we can see the similarity between the translation output of using Xl8 and OpenAI services for Vietnamese and Thai language.
Experiment 2.2: Benchmarking the similarity between our 10-second-segmentation and full-duration without segmentation
This experiment evaluates the translation quality of the 10-second segmentation method compared to full-length translations for Thai audio. The results demonstrate that XL8.ai outper-forms OpenAI in key metrics such as TFIDF similarity (0.7191 vs. 0.3668) and ChrF score (72.4867 vs. 47.7975), indicating better contextual and character-level accuracy. While OpenAI achieves a slightly higher Rouge-L score (0.5714 vs. 0.5000), both services show similar SBERT similarity scores around 87%, reflecting a high degree of semantic preservation. Overall, XL8.ai’s translations for Thai audio provide higher quality in a segmented approach, making it a better option for maintaining translation consistency and accuracy when using the 10-second segmen-tation method.
This figure compares translation similarity metrics between XL8.ai and OpenAI for Vietnamese audio using a 10-second segmentation method versus full-length translations. The results show that XL8.ai outperforms OpenAI in TFIDF similarity (0.9166 vs. 0.6989), ChrF score (69.6081 vs. 47.6815), and Rouge-L score (0.6136 vs. 0.4455), indicating superior contextual relevance, character-level accuracy, and sequence preservation in translations. Both services demonstrate high semantic similarity, with OpenAI scoring slightly higher on SBERT similarity (0.8948 vs. 0.8777). Overall, XL8.ai proves to be more effective in maintaining translation quality with the 10-second segmentation method, making it a stronger option for Vietnamese audio processing.
From the comparison of both graphs for Thai and Vietnamese translations, it is evident that XL8.ai consistently outperforms OpenAI in key metrics such as TFIDF similarity, ChrF score, and Rouge-L score. These results demonstrate that XL8.ai provides superior contextual relevance, character-level accuracy, and sequence preservation when using the 10-second segmentation method. Additionally, the high SBERT similarity scores for both XL8.ai and OpenAI (around 87-89%) across both languages indicate that the segmentation method has minimal impact on the semantic meaning of the translations. This highlights that XL8.ai is the preferred translation service for maintaining high-quality translations with segmented audio while ensuring that the segmentation approach does not significantly affect the overall translation quality.
The simulation evaluates the delay time comparing the latest content of streaming input. The goal is to analyze the time required for content to pass through various processing stages, including video and audio segmentation, transcription, translation, and synchronization. As we parallel processing the with multiple workers, so the delay time is only the total time for processing one chunk.
-
Average time for video/ audio segmentation is around 10.1 seconds.
-
Average time for transcription process (OpenAI Whisper response and cron job processing): 7.8 seconds
-
Average time for translation process (XL8.ai translation and cron job for creating subtitles): 8.4 seconds
-
Average time for synchronization of m3u8 playlist is around 0.01 seconds.
Thus, the total delay time is around ~26.3 seconds.
In addition, we also perform the simulation with HLS.js library. At a delay rate of around 27 seconds, the video with subtitles runs smoothly.