extraction node in workflow word docx extraction order issue #14409

cyflhn · 2025-02-26T09:58:27Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.3

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Create an extraction node in an workflow.
upload a docx file
check the extraction result.

✔️ Expected Behavior

The order of paragraph and table should be in consistent with original document.

❌ Actual Behavior

The order of paragraph and table is not in consistent with original document.

dosubot · 2025-02-26T10:00:32Z

I found a similar open issue regarding the extraction order of paragraphs and tables from DOCX files. The problem arises because the code processes paragraphs and tables separately, which disrupts their original order. The suggested solution is to traverse the elements in the exact order they appear in the document, processing each paragraph and table as encountered. This approach ensures the extracted content retains the original sequence of the document ^[1].

Additionally, the WordExtractor class in the Dify platform's codebase provides a method to maintain the order by iterating over the document's elements and processing them based on their type (paragraph or table) ^[2]^[3].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

cyflhn linked a pull request Feb 26, 2025 that will close this issue

fix: the order issue for extraction node in workflow for docx file extraction result #14411

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extraction node in workflow word docx extraction order issue #14409

extraction node in workflow word docx extraction order issue #14409

cyflhn commented Feb 26, 2025

dosubot bot commented Feb 26, 2025

extraction node in workflow word docx extraction order issue #14409

extraction node in workflow word docx extraction order issue #14409

Comments

cyflhn commented Feb 26, 2025

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

dosubot bot commented Feb 26, 2025