Skip to content

[Bug]: The word format document parse error! #12604

@simonjhy

Description

@simonjhy

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (Language Policy).
  • Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

ea619db

RAGFlow image version

0.23.1

Other environment information

ubuntu 22.04
intel x86

Actual behavior

2026-01-14 09:59:06,447 INFO 55 set_progress(8676a467f0ec11f098c96e686cb558a6), progress: -1, progress_msg: 09:59:06 [ERROR][Exception]: "There is no item named 'word/#_文档目录' in the archive"
2026-01-14 09:59:06,449 ERROR 55 handle_task got exception for task {"id": "8676a467f0ec11f098c96e686cb558a6", "doc_id": "85be2a40f0ec11f091236e686cb558a6", "from_page": 0, "to_page": 100000000, "retry_count": 0, "kb_id": "0436c772efaf11f0bb65ba5758d06a6b", "parser_id": "naive", "parser_config": {"layout_recognize": "DeepDOC", "chunk_token_num": 0, "delimiter": "(?m)^(\u5bc4\u5b58\u5668|\u5bc4\u5b58\u5668\u540d\u79f0|\u6a21\u5757|\d+\.)", "enable_children": true, "children_delimiter": "\n\n|(?=\u4f4d\s*\d+)|(?=\u504f\u79fb\u5730\u5740)|(?=\u5bc4\u5b58\u5668\u540d\u79f0)", "auto_keywords": 10, "auto_questions": 10, "html4excel": true, "topn_tags": 3, "toc_extraction": true, "image_table_context_window": 256, "overlapped_percent": 0.1, "mineru_parse_method": "auto", "mineru_formula_enable": true, "mineru_table_enable": true, "mineru_lang": "English", "raptor": {"use_raptor": true, "prompt": "Please summarize the following paragraphs. Be careful with the numbers, do not make things up. Paragraphs as following:\n {cluster_content}\nThe above is the content you need to summarize.", "max_token": 512, "threshold": 0.1, "max_cluster": 64, "random_seed": 1236, "scope": "file"}, "graphrag": {"use_graphrag": true, "entity_types": ["Chip", "Peripheral", "Register", "BitField", "Protocol", "Function", "Firmware", "Parameter", "Address", "MemoryRegion", "Bus", "Clock", "Interrupt", "DMAChannel", "Pin", "PinFunction", "Signal", "GPIOPort", "AlternateFunction", "RegisterBlock", "ResetValue", "AccessType", "Configuration", "Mode", "Constraint", "Timing", "Formula", "StatusFlag", "Error", "Condition", "ChipFamily", "ChipVariant"], "method": "general", "resolution": true, "community": true}, "metadata": [], "enable_metadata": false, "llm_id": "Qwen3-32B___VLLM@VLLM", "image_context_size": 256, "table_context_size": 256}, "name": "SLM32F1XM606 User manual cn.docx", "type": "doc", "location": "SLM32F1XM606 User manual cn.docx", "size": 13300861, "tenant_id": "1e172d2defa111f09b9bba5758d06a6b", "language": "Chinese", "embd_id": "bge-m3___LocalAI@LocalAI", "pagerank": 6, "kb_parser_config": {"layout_recognize": "DeepDOC", "chunk_token_num": 512, "delimiter": "(?m)^(\u5bc4\u5b58\u5668|\u5bc4\u5b58\u5668\u540d\u79f0|\u6a21\u5757|\d+\.)", "enable_children": true, "children_delimiter": "\n\n|(?=\u4f4d\s*\d+)|(?=\u504f\u79fb\u5730\u5740)|(?=\u5bc4\u5b58\u5668\u540d\u79f0)", "auto_keywords": 10, "auto_questions": 10, "html4excel": true, "topn_tags": 3, "toc_extraction": true, "image_table_context_window": 256, "overlapped_percent": 0.1, "mineru_parse_method": "auto", "mineru_formula_enable": true, "mineru_table_enable": true, "mineru_lang": "English", "raptor": {"use_raptor": true, "prompt": "Please summarize the following paragraphs. Be careful with the numbers, do not make things up. Paragraphs as following:\n {cluster_content}\nThe above is the content you need to summarize.", "max_token": 512, "threshold": 0.1, "max_cluster": 64, "random_seed": 1236, "scope": "file"}, "graphrag": {"use_graphrag": true, "entity_types": ["Chip", "Peripheral", "Register", "BitField", "Protocol", "Function", "Firmware", "Parameter", "Address", "MemoryRegion", "Bus", "Clock", "Interrupt", "DMAChannel", "Pin", "PinFunction", "Signal", "GPIOPort", "AlternateFunction", "RegisterBlock", "ResetValue", "AccessType", "Configuration", "Mode", "Constraint", "Timing", "Formula", "StatusFlag", "Error", "Condition", "ChipFamily", "ChipVariant"], "method": "general", "resolution": true, "community": true}, "metadata": [], "enable_metadata": false, "llm_id": "Qwen3-32B___VLLM@VLLM", "image_context_size": 256, "table_context_size": 256}, "img2txt_id": "", "asr_id": "", "llm_id": "Qwen3-32B___VLLM@VLLM", "update_time": 1768355910752, "task_type": ""}
Traceback (most recent call last):
File "/ragflow/rag/svr/task_executor.py", line 1150, in handle_task
await do_handle_task(task)
File "/ragflow/common/connection_utils.py", line 74, in async_wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ragflow/rag/svr/task_executor.py", line 1053, in do_handle_task
chunks = await build_chunks(task, progress_callback)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ragflow/common/connection_utils.py", line 74, in async_wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ragflow/rag/svr/task_executor.py", line 262, in build_chunks
cks = await asyncio.to_thread(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/futures.py", line 287, in await
yield self # This tells Task to wait for completion.
^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/tasks.py", line 385, in __wakeup
future.result()
File "/usr/lib/python3.12/asyncio/futures.py", line 203, in result
raise self._exception.with_traceback(self._exception_tb)
File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ragflow/rag/app/naive.py", line 727, in chunk
sections, tables = Docx()(filename, binary)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/ragflow/rag/app/naive.py", line 339, in call
filename) if not binary else Document(BytesIO(binary))
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ragflow/.venv/lib/python3.12/site-packages/docx/api.py", line 27, in Document
document_part = cast("DocumentPart", Package.open(docx).main_document_part)
^^^^^^^^^^^^^^^^^^
File "/ragflow/.venv/lib/python3.12/site-packages/docx/opc/package.py", line 126, in open
pkg_reader = PackageReader.from_file(pkg_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ragflow/.venv/lib/python3.12/site-packages/docx/opc/pkgreader.py", line 25, in from_file
sparts = PackageReader._load_serialized_parts(phys_reader, pkg_srels, content_types)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ragflow/.venv/lib/python3.12/site-packages/docx/opc/pkgreader.py", line 51, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "/ragflow/.venv/lib/python3.12/site-packages/docx/opc/pkgreader.py", line 82, in _walk_phys_parts
for partname, blob, reltype, srels in next_walker:
File "/ragflow/.venv/lib/python3.12/site-packages/docx/opc/pkgreader.py", line 82, in _walk_phys_parts
for partname, blob, reltype, srels in next_walker:
File "/ragflow/.venv/lib/python3.12/site-packages/docx/opc/pkgreader.py", line 79, in _walk_phys_parts
blob = phys_reader.blob_for(partname)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ragflow/.venv/lib/python3.12/site-packages/docx/opc/phys_pkg.py", line 83, in blob_for
return self._zipf.read(pack_uri.membername)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/zipfile/init.py", line 1580, in read
with self.open(name, "r", pwd) as fp:
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/zipfile/init.py", line 1617, in open
zinfo = self.getinfo(name)
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/zipfile/init.py", line 1545, in getinfo
raise KeyError(
KeyError: "There is no item named 'word/#_文档目录' in the archive"

Expected behavior

No response

Steps to reproduce

upload a docx document, I am not sure which part element cause this error.

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 bugSomething isn't working, pull request that fix bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions