PDFMark 是一个功能强大的PDF转Markdown转换工具,支持命令行和Web界面两种使用方式,能够智能识别文档结构并保留原有的层级格式。特别适用于学术论文、技术文档和规范性文件的格式转换。
- 🎯 智能标题识别:自动检测数字编号、中文编号、括号编号等多种标题格式
- 📊 层级结构保留:完整保持原文档的章节层级关系
- 🌐 双重界面:支持命令行脚本和Web UI两种使用方式
- 🔧 格式优化:自动清理多余空行,优化列表和段落格式
- 🚀 高效转换:基于PyMuPDF,转换速度快且准确度高
- 🎨 现代化UI:响应式Web界面,支持拖拽上传
- 📱 跨平台:支持Windows、macOS、Linux
- 🌍 中文友好:完美支持中文文档和编号格式
python install_requirements.py
pip install PyPDF2 PyMuPDF pandas fastapi uvicorn python-multipart
- 基本使用:
python pdf_to_markdown.py
- 自定义输入文件:
# 修改 pdf_to_markdown.py 中的文件名
pdf_file = "你的PDF文件.pdf"
- 启动Web服务:
python start_webui.py
- 访问Web界面:
- 在浏览器中打开:
http://localhost:8000
- 上传PDF文件
- 点击"开始转换"
- 下载生成的Markdown文件
- 在浏览器中打开:
PDFMark/
├── README.md # 项目说明文档
├── install_requirements.py # 依赖安装脚本
├── pdf_to_markdown.py # 命令行转换脚本
├── app.py # Web UI应用
├── start_webui.py # Web UI启动脚本
├── uploads/ # 上传文件临时目录
├── outputs/ # 转换结果输出目录
└── static/ # 静态资源目录
工具能够智能识别以下标题格式:
- 数字编号:
1.
2.1
3.2.1
等 - 中文编号:
一、
二、
三、
等 - 括号编号:
(1)
(2)
(一)
(二)
等 - 章节标题:包含"第X章"等关键词的标题
- 全大写标题:短行全大写文本
第一章 绪论
1.1 研究背景
1.1.1 问题提出
1.1.2 研究意义
1.2 研究目标
(1) 主要目标
(2) 次要目标
二、文献综述
# 第一章 绪论
## 1.1 研究背景
### 1.1.1 问题提出
### 1.1.2 研究意义
## 1.2 研究目标
### (1) 主要目标
### (2) 次要目标
# 二、文献综述
在 detect_headings()
函数中可以自定义标题识别规则:
# 添加新的标题格式检测
elif re.match(r'^第[一二三四五六七八九十]+节', line):
processed_lines.append('## ' + line)
欢迎提交Issue和Pull Request来改进项目!
本项目采用MIT许可证 - 详见 LICENSE 文件
PDFMark is a powerful PDF to Markdown conversion tool that supports both command-line and web interface usage. It intelligently recognizes document structure and preserves the original hierarchical formatting. Particularly suitable for academic papers, technical documentation, and regulatory documents.
- 🎯 Smart Title Recognition: Automatically detects various title formats including numeric numbering, Chinese numbering, parenthetical numbering, etc.
- 📊 Hierarchy Preservation: Completely maintains the chapter and section hierarchy of the original document
- 🌐 Dual Interface: Supports both command-line scripts and Web UI
- 🔧 Format Optimization: Automatically cleans up extra blank lines and optimizes list and paragraph formatting
- 🚀 Efficient Conversion: Based on PyMuPDF for fast and accurate conversion
- 🎨 Modern UI: Responsive web interface with drag-and-drop upload support
- 📱 Cross-platform: Supports Windows, macOS, Linux
- 🌍 Chinese-friendly: Perfect support for Chinese documents and numbering formats
python install_requirements.py
pip install PyPDF2 PyMuPDF pandas fastapi uvicorn python-multipart
- Basic Usage:
python pdf_to_markdown.py
- Custom Input File:
# Modify the filename in pdf_to_markdown.py
pdf_file = "your_pdf_file.pdf"
- Start Web Service:
python start_webui.py
- Access Web Interface:
- Open in browser:
http://localhost:8000
- Upload PDF file
- Click "Start Conversion"
- Download the generated Markdown file
- Open in browser:
PDFMark/
├── README.md # Project documentation
├── install_requirements.py # Dependency installation script
├── pdf_to_markdown.py # Command-line conversion script
├── app.py # Web UI application
├── start_webui.py # Web UI startup script
├── uploads/ # Temporary upload directory
├── outputs/ # Conversion output directory
└── static/ # Static resources directory
The tool can intelligently recognize the following title formats:
- Numeric Numbering:
1.
2.1
3.2.1
etc. - Chinese Numbering:
一、
二、
三、
etc. - Parenthetical Numbering:
(1)
(2)
(一)
(二)
etc. - Chapter Titles: Titles containing keywords like "Chapter X"
- All-caps Titles: Short lines of all-uppercase text
Chapter 1 Introduction
1.1 Research Background
1.1.1 Problem Statement
1.1.2 Research Significance
1.2 Research Objectives
(1) Primary Objectives
(2) Secondary Objectives
II. Literature Review
# Chapter 1 Introduction
## 1.1 Research Background
### 1.1.1 Problem Statement
### 1.1.2 Research Significance
## 1.2 Research Objectives
### (1) Primary Objectives
### (2) Secondary Objectives
# II. Literature Review
You can customize title recognition rules in the detect_headings()
function:
# Add new title format detection
elif re.match(r'^Section [IVXLCDM]+', line):
processed_lines.append('## ' + line)
Issues and Pull Requests are welcome to improve the project!
This project is licensed under the MIT License - see the LICENSE file for details
If this project helps you, please consider giving it a star! ⭐
- Issues: GitHub Issues
- Email: [email protected]
PDFMark - Making PDF to Markdown conversion simple and intelligent! 🚀