Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extraction node in workflow word docx extraction order issue #14409

Open
5 tasks done
cyflhn opened this issue Feb 26, 2025 · 1 comment · May be fixed by #14411
Open
5 tasks done

extraction node in workflow word docx extraction order issue #14409

cyflhn opened this issue Feb 26, 2025 · 1 comment · May be fixed by #14411

Comments

@cyflhn
Copy link
Contributor

cyflhn commented Feb 26, 2025

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.3

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

  1. Create an extraction node in an workflow.
  2. upload a docx file
  3. check the extraction result.

✔️ Expected Behavior

The order of paragraph and table should be in consistent with original document.

❌ Actual Behavior

The order of paragraph and table is not in consistent with original document.

Copy link

dosubot bot commented Feb 26, 2025

I found a similar open issue regarding the extraction order of paragraphs and tables from DOCX files. The problem arises because the code processes paragraphs and tables separately, which disrupts their original order. The suggested solution is to traverse the elements in the exact order they appear in the document, processing each paragraph and table as encountered. This approach ensures the extracted content retains the original sequence of the document [1].

Additionally, the WordExtractor class in the Dify platform's codebase provides a method to maintain the order by iterating over the document's elements and processing them based on their type (paragraph or table) [2][3].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant