A Modern Multimodal Corpus Research Software
Features | Installation | Quick Start | Documentation | License
Meta-Lingo is a comprehensive desktop application designed for corpus linguistics research. Built with modern technologies (Electron + React + Python FastAPI), it provides powerful tools for multimodal corpus management, linguistic analysis, and annotation.
- Multimodal Support: Text, audio, and video files with drag-and-drop upload
- Audio Transcription: Whisper Large V3 Turbo with word-level timestamps
- Forced Alignment: Wav2Vec2 word-level alignment for English audio (automatic)
- Pitch Extraction: TorchCrepe F0 extraction for English audio (automatic)
- Video Analysis: YOLOv8 object detection and CLIP semantic classification
- Automatic Annotation: SpaCy NLP (POS/NER/Dependency), USAS semantic domains, MIPVU metaphor identification
- Metadata Management: Language, author, source, text type with tag system
| Module | Description |
|---|---|
| Word Frequency | Frequency analysis with POS filtering, lemma/word form selection, visualization |
| N-gram Analysis | 2-6 gram support, nested grouping, Sankey diagrams |
| Keyword Extraction | TF-IDF, TextRank, YAKE!, RAKE, and 9 keyness statistics methods |
| Collocation | KWIC search with 6 modes, CQL query language, CQL Builder |
| Synonym Analysis | WordNet integration with network visualization |
| Semantic Domain | USAS-based analysis with dual view (by domain/by word) |
| Metaphor Analysis | MIPVU-based detection; 3-step pipeline (word filter → rules → Clause model); source color-coding by POS |
| Word Sketch | Grammar pattern analysis (50 relations), logDice scoring, difference comparison |
| Topic Modeling | BERTopic, LDA, LSA, NMF with dynamic topic analysis |
| Bibliography | Refworks parsing (WOS/CNKI), shadow corpus for abstracts, network visualization, burst detection; analysis modules support corpus/literature toggle and library selection (all / by keyword / manual). |
- Text Annotation: Sentence-level display, intelligent segmentation, batch annotation
- Multimodal Annotation: Video frame tracking, DAW-style timeline, YOLO overlay
- Audio Waveform Annotation: Wavesurfer.js waveform visualization with word alignment, pitch curve overlay, box drawing annotation (English audio only)
- Framework Management: 49 preset frameworks (SFL, UAM, etc.), custom framework support
- Inter-coder Reliability: Fleiss' Kappa, Cohen's Kappa, Krippendorff's Alpha, Gold Standard support (plain text archives only)
- Syntax Visualization: Constituency and dependency parsing
- Dictionary Lookup: Macmillan, Longman Collocations with fuzzy search
- Bilingual Interface: Chinese and English with real-time switching
- Custom Wallpaper: Personalized application background
- Export Options: CSV, PNG, SVG for all visualizations
+----------------------------------------------------------+
| Meta-Lingo |
+----------------------------------------------------------+
| Frontend (Electron + React + TypeScript) |
| - Material-UI components |
| - Zustand state management |
| - D3.js / Plotly.js visualizations |
| - i18next internationalization |
+----------------------------------------------------------+
| HTTP REST API |
+----------------------------------------------------------+
| Backend (Python FastAPI) |
| - SpaCy NLP processing |
| - USAS semantic tagging (PyMUSAS) |
| - MIPVU metaphor detection (DeBERTa) |
| - BERTopic / LDA / LSA / NMF topic modeling |
| - Whisper / YOLO / CLIP multimodal analysis |
+----------------------------------------------------------+
| Data Storage |
| - SQLite database (metadata) |
| - File system (corpora, annotations) |
+----------------------------------------------------------+
| Technology | Purpose |
|---|---|
| Electron 28+ | Desktop application framework |
| React 18 | UI framework |
| TypeScript 5 | Type safety |
| Material-UI 5 | Component library |
| D3.js 7 | Data visualization |
| Plotly.js | Interactive charts |
| Technology | Purpose |
|---|---|
| Python 3.12 | Runtime environment |
| FastAPI | Web framework |
| SpaCy 3.8+ | NLP processing |
| PyMUSAS | Semantic tagging |
| BERTopic | Topic modeling |
| Transformers | Whisper/CLIP models |
| Ultralytics | YOLOv8 |
Visit our official website to download the latest version:
https://tltanium.github.io/meta-lingo-website/
Source code in this repository is provided for reference and academic verification only. Please use the official distribution above to run Meta-Lingo.
After installing from the website, launch the application and follow the in-app guidance. For documentation, use the Help module inside the application.
- In-app Help: Access via the Help module with bilingual documentation
- API Documentation: http://localhost:8000/docs (when backend is running)
| Category | Endpoints |
|---|---|
| Corpus | /api/corpus/* - CRUD, upload, annotation |
| Analysis | /api/analysis/* - Word frequency, N-gram, keywords, etc. |
| Collocation | /api/collocation/* - KWIC search, CQL parsing |
| Topic Modeling | /api/topic-modeling/* - BERTopic, LDA, LSA, NMF |
| Annotation | /api/annotation/*, /api/framework/* |
| Word Sketch | /api/sketch/* - Grammar patterns, difference |
| Bibliography | /api/biblio/* - Libraries, visualization |
Full API documentation available at /docs endpoint.
Meta-Lingo integrates several pre-trained models:
| Model | Purpose | Source |
|---|---|---|
| Whisper Large V3 Turbo | Audio transcription | OpenAI |
| Wav2Vec2-base-960h | Forced alignment (English) | |
| TorchCrepe Full | Pitch extraction (F0) | maxrmorrison/torchcrepe |
| YOLOv8 | Object detection | Ultralytics |
| CLIP ViT-Large-Patch14 | Image classification | OpenAI |
| SpaCy en/zh_core_web_lg | NLP processing | Explosion |
| DeBERTa-v3-large-clause-metaphor | MIPVU metaphor detection (F1 75.83) | tommyleo2077 |
| Sentence-BERT | Text embeddings | sentence-transformers |
This project is currently maintained for academic research purposes. For bug reports or feature requests, please open an issue.
- Metaphor Analysis — Clause-only pipeline: Removed HiTZ model entirely. All tokens now annotated by a single
deberta-v3-large-clause-metaphormodel using full-sentence context (max_length=192). 3-step pipeline: word-form filter → SpaCy rule filter → Clause model. Function words (IN/DT/RB/RP) keep orange tag (finetuned); other words use green tag (clause). Legacyhitzsource in existing annotations treated asclause(green). Help docs updated with Clause model accuracy (Precision 78.08%, Recall 73.69%, F1 75.83; DT F1 90.87, IN F1 87.87).
- Sentiment Analysis — USAS mode: Search panel adds "USAS Semantic Domain" mode; results aggregate sentiment scores by domain code with full domain name tooltip; word cloud uses domain names; CSV export adds
domain_namecolumn.
- Bibliography Visualization: PDF export rewritten via Electron IPC (
printToPDF) to fix blank-page issue on large documents. Paper column with PDF upload and first-page thumbnail. 11 AI-generated fields per entry (research goal, questions, design, conclusions, mechanism, contribution, limitations, value, dialogue, future work, summary). Batch AI generation for multiple entries. Column visibility control. Export to styled PDF report.
- Sentiment Analysis (NRC): Full NRC-EmoLex annotation added to corpus pipeline after MIPVU. New analysis page with polarity (pie chart + word cloud) and emotion dimensions (radar chart + word cloud). Result table cross-links to collocation/word sketch/N-gram/semantic domain. Backend:
nrc_service.py,sentiment_analysis_service.py,POST /api/analysis/sentiment.
- Cross-module links default to case-insensitive search. Collocation wordlist search mode (multi-word input, one per line).
- Bibliography: Bulk delete for selected entries. Relevance rating (0–5 stars), tags, and notes columns added to entry table and detail dialog. CSV export.
- Metaphor Analysis: Added Clause model (
deberta-v3-large-clause-metaphor) to MIPVU pipeline for function-word annotation. POS-group statistics (IN/DT/RB/RP/OTHER metaphor rates) shown in results table header.
- Cross-module corpus selection sync across all analysis modules. Topic modeling bibliography mode with publication year for dynamic analysis.
- AI Assistant: Robot icon in all analysis modules' left panel (requires Ollama or OpenAI-compatible API); sends current page state as context. OpenAI-compatible API support in Settings (address / key / model). Cross-module library-mode link sync fixes.
- Semantic domain analysis: CQL cross-link, word cloud, domain name display. Collocation network expand on click, MinSense fix, Word Sketch Difference word-form/lemma mode. Topic modeling: N-gram preprocessing mode, LDA/LSA/NMF dynamic topic analysis.
- Praat acoustic analysis: Spectrogram, formants (F1–F5), intensity, HNR, jitter, shimmer. Chinese audio full visualization support. Corpus building script (
saves/corpus/corpus_building.py) for 13 English corpora.
- Ridge plot SVG/PNG full export, CQL top-level OR operator and template auto-fill. Collocation search mode (lemma/word form). Result table search fix across all modules. Unified UI spacing and labeling. Cross-module N-gram link.
- Audio waveform annotation (Wavesurfer.js + TorchCrepe pitch + box drawing). Full annotation pipeline for audio/video transcripts. Inter-coder reliability gold standard fix. CQL distance selector fix.
- LLM topic naming (Ollama). USAS annotation modes (rule / neural / hybrid). Stopword removal (20+ languages). Custom wallpaper. Keyword extraction enhancements. Theme/Rheme auto-annotation. Dark theme for all topic modeling visualizations.
For the full version history, see PROJECT.md or the Git commit log.
Meta-Lingo Software License (Non-Commercial)
Meta-Lingo is an independently developed corpus research software by Tommy Leo, protected under the Copyright Law of the People's Republic of China.
This software is licensed only for:
- Personal learning
- Academic research
- Non-commercial corpus analysis and linguistic research
Commercial use is prohibited without written permission.
See LICENSE_CN.txt (Chinese) or LICENSE_EN.txt (English) for full terms.
Copyright 2026 Tommy Leo. All rights reserved.
