Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Document Extractor tool encountered several abnormal situations #14110

Open
5 tasks done
yoke1990 opened this issue Feb 20, 2025 · 5 comments
Open
5 tasks done

The Document Extractor tool encountered several abnormal situations #14110

yoke1990 opened this issue Feb 20, 2025 · 5 comments
Labels
🐞 bug Something isn't working

Comments

@yoke1990
Copy link

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.3

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

1.Upload docx,then Document Extractor report(Only a few docx documents will have errors):
Failed to extract text from DOC/DOCX: "There is no item named 'NULL' in the archive"

Image

2.Upload file, the file content is too large, exceeding 80000 characters,Document Extractor report:

Image

3.Document Extractor tool exception occurred, there is no exception handling option, only error can be reported

4.Document Extractor tool extracts Excel content, but many NAN useless characters appear. Resulting in over 80000 characters.

It will extract both columns Excel and 255, such as "Unnamed: 253 | Unnamed: 254 | Unnamed: 255". In fact, most sequences are blank. So many 'nan | nan |'

So report “The length of output variable result must be less than 80000 characters”

5.Android phone on wechat, no file upload option

Image

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

Copy link

dosubot bot commented Feb 20, 2025

@yoke1990 Hello there! I'm here to help you with bug reports, answer questions, and even guide you on becoming a contributor. Let's squash those bugs together! 🤖

  1. DOCX Extraction Error: The error "There is no item named 'NULL' in the archive" suggests an issue with the DOCX file structure. This might be due to a corrupted file or an unsupported format. Ensure the DOCX files are not corrupted and are in a standard format.

  2. Character Limit Exceeded: The Document Extractor has a character limit of 80,000 for output variables. If the content exceeds this limit, consider splitting the document into smaller parts before processing or using the tool's chunking mechanism to handle large content [1].

  3. Exception Handling: The Document Extractor tool allows for customized error handling. You can specify a default value or a fail branch to handle exceptions without stopping the process [2].

  4. Excel NAN Values: The tool processes Excel files using pandas, which can result in 'nan' values for empty cells. To avoid this, ensure that the Excel files are cleaned of unnecessary empty columns before processing. The tool uses dropna to handle empty rows, but you might need to preprocess the file to remove empty columns [3].

  5. File Upload on Android/WeChat: There is no specific configuration mentioned for handling file uploads on mobile devices like Android or WeChat within the Dify platform. This might require additional customization or checking the mobile app's settings for file upload capabilities.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@dosubot dosubot bot added the 🐞 bug Something isn't working label Feb 20, 2025
@yoke1990
Copy link
Author

Both of these documents caused errors in the document extraction tool. You can test it

925523fea21073501b9765ffe0be96eb.docx

a6af9a7b61b8f03363aca1d45e378e23.xls

@yoke1990
Copy link
Author

yoke1990 commented Feb 20, 2025

@dosu

hi.Thx for your reply

1.The docx file that causes errors can be edited and saved correctly by Word and WPS. The document written by the user using regular Office reports an error. I replied above the docx file, which can reproduce the error.

925523fea21073501b9765ffe0be96eb.docx

2 and 3.If the content exceeds 80000 character limit, The Document Extractor tool is not allows for customized error handling on v0.15.3

Image

4.It looks like it will force the reading of an Excel 255 line list. Generate a large number of empty columns

You can test it

a6af9a7b61b8f03363aca1d45e378e23.xls

5.With the same configuration, the option to upload documents can be displayed normally on iPhone. But Android phones don't work.

Copy link

dosubot bot commented Feb 20, 2025

  1. DOCX File Error: The error with the DOCX file might be due to specific content or structure that the Document Extractor tool cannot handle, even if it appears fine in Word or WPS. The _extract_text_from_docx function in the code handles DOCX extraction and raises a TextExtractionError if it encounters issues [1]. You might need to inspect the file's content or structure for anomalies that could cause this error.

  2. Character Limit Handling: In version 0.15.3, the Document Extractor tool enforces a character limit of 80,000, and it seems there is no built-in mechanism for customized error handling when this limit is exceeded. You may need to manually split the content into smaller parts before processing.

  3. Excel File Empty Columns: The tool processes Excel files using pandas, and it drops rows where all elements are NaN using df.dropna(how="all", inplace=True). However, it seems to read all columns, including empty ones, which might result in unnecessary 'nan' values. Preprocessing the Excel file to remove empty columns before using the tool could help [1][2].

  4. File Upload on Android: The issue with file uploads on Android but not on iPhone might be related to platform-specific configurations or limitations. Unfortunately, there is no specific documentation or code in the repository addressing this difference. You might need to explore the mobile app's settings or configurations for file uploads.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@yoke1990
Copy link
Author

  1. I also know that it's difficult to be compatible with different document formats, so I will come up with a solution myself, such as writing a tool to convert to a standard format.

Of course, I would prefer Dify to improve compatibility with office documents

  1. I hope that future versions will add exception handling options to the document extraction tool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant