The Document Extractor tool encountered several abnormal situations #14110

yoke1990 · 2025-02-20T10:26:51Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.3

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

1.Upload docx，then Document Extractor report（Only a few docx documents will have errors）:
Failed to extract text from DOC/DOCX: "There is no item named 'NULL' in the archive"

2.Upload file, the file content is too large, exceeding 80000 characters，Document Extractor report:

3.Document Extractor tool exception occurred, there is no exception handling option, only error can be reported

4.Document Extractor tool extracts Excel content, but many NAN useless characters appear. Resulting in over 80000 characters.

It will extract both columns Excel and 255, such as "Unnamed: 253 | Unnamed: 254 | Unnamed: 255". In fact, most sequences are blank. So many 'nan | nan |'

So report “The length of output variable result must be less than 80000 characters”

5.Android phone on wechat, no file upload option

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2025-02-20T10:28:58Z

@yoke1990 Hello there! I'm here to help you with bug reports, answer questions, and even guide you on becoming a contributor. Let's squash those bugs together! 🤖

DOCX Extraction Error: The error "There is no item named 'NULL' in the archive" suggests an issue with the DOCX file structure. This might be due to a corrupted file or an unsupported format. Ensure the DOCX files are not corrupted and are in a standard format.
Character Limit Exceeded: The Document Extractor has a character limit of 80,000 for output variables. If the content exceeds this limit, consider splitting the document into smaller parts before processing or using the tool's chunking mechanism to handle large content ^[1].
Exception Handling: The Document Extractor tool allows for customized error handling. You can specify a default value or a fail branch to handle exceptions without stopping the process ^[2].
Excel NAN Values: The tool processes Excel files using pandas, which can result in 'nan' values for empty cells. To avoid this, ensure that the Excel files are cleaned of unnecessary empty columns before processing. The tool uses dropna to handle empty rows, but you might need to preprocess the file to remove empty columns ^[3].
File Upload on Android/WeChat: There is no specific configuration mentioned for handling file uploads on mobile devices like Android or WeChat within the Dify platform. This might require additional customization or checking the mobile app's settings for file upload capabilities.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

yoke1990 · 2025-02-20T10:42:49Z

Both of these documents caused errors in the document extraction tool. You can test it

925523fea21073501b9765ffe0be96eb.docx

a6af9a7b61b8f03363aca1d45e378e23.xls

yoke1990 · 2025-02-20T11:03:10Z

@dosu

hi.Thx for your reply

1.The docx file that causes errors can be edited and saved correctly by Word and WPS. The document written by the user using regular Office reports an error. I replied above the docx file, which can reproduce the error.

925523fea21073501b9765ffe0be96eb.docx

2 and 3.If the content exceeds 80000 character limit, The Document Extractor tool is not allows for customized error handling on v0.15.3

4.It looks like it will force the reading of an Excel 255 line list. Generate a large number of empty columns

You can test it

a6af9a7b61b8f03363aca1d45e378e23.xls

5.With the same configuration, the option to upload documents can be displayed normally on iPhone. But Android phones don't work.

dosubot · 2025-02-20T11:06:50Z

DOCX File Error: The error with the DOCX file might be due to specific content or structure that the Document Extractor tool cannot handle, even if it appears fine in Word or WPS. The _extract_text_from_docx function in the code handles DOCX extraction and raises a TextExtractionError if it encounters issues ^[1]. You might need to inspect the file's content or structure for anomalies that could cause this error.
Character Limit Handling: In version 0.15.3, the Document Extractor tool enforces a character limit of 80,000, and it seems there is no built-in mechanism for customized error handling when this limit is exceeded. You may need to manually split the content into smaller parts before processing.
Excel File Empty Columns: The tool processes Excel files using pandas, and it drops rows where all elements are NaN using df.dropna(how="all", inplace=True). However, it seems to read all columns, including empty ones, which might result in unnecessary 'nan' values. Preprocessing the Excel file to remove empty columns before using the tool could help ^[1]^[2].
File Upload on Android: The issue with file uploads on Android but not on iPhone might be related to platform-specific configurations or limitations. Unfortunately, there is no specific documentation or code in the repository addressing this difference. You might need to explore the mobile app's settings or configurations for file uploads.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

yoke1990 · 2025-02-20T11:27:36Z

I also know that it's difficult to be compatible with different document formats, so I will come up with a solution myself, such as writing a tool to convert to a standard format.

Of course, I would prefer Dify to improve compatibility with office documents

I hope that future versions will add exception handling options to the document extraction tool

dosubot bot added the 🐞 bug Something isn't working label Feb 20, 2025

dosubot bot mentioned this issue Feb 26, 2025

AIPPT built-in tools, fill in the outline to report errors #14368

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Document Extractor tool encountered several abnormal situations #14110

The Document Extractor tool encountered several abnormal situations #14110

yoke1990 commented Feb 20, 2025

dosubot bot commented Feb 20, 2025

yoke1990 commented Feb 20, 2025

yoke1990 commented Feb 20, 2025 •

edited

Loading

dosubot bot commented Feb 20, 2025

yoke1990 commented Feb 20, 2025

The Document Extractor tool encountered several abnormal situations #14110

The Document Extractor tool encountered several abnormal situations #14110

Comments

yoke1990 commented Feb 20, 2025

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

dosubot bot commented Feb 20, 2025

yoke1990 commented Feb 20, 2025

yoke1990 commented Feb 20, 2025 • edited Loading

dosubot bot commented Feb 20, 2025

yoke1990 commented Feb 20, 2025

yoke1990 commented Feb 20, 2025 •

edited

Loading