[ReOpen] Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13013

tigflanker · 2025-01-24T07:46:39Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.2

Cloud or Self Hosted

Cloud

Steps to reproduce

@crazywoola Hi, I have reproduced the same issue on BOTH the cloud version and the latest version 0.15.2.

If there is a problem with my POST call, please point it out for me. Thank you.

Happy New Year.

POST:

curl --location --request POST 'https://api.dify.ai/v1/datasets/004acde6-967b-4c6e-90b7-dcf634adb392/document/create_by_text' \
--header 'Authorization: Bearer dataset-FeWv2aj6r68Y8wK91gAKiiEh' \
--header 'Content-Type: application/json' \
--data-raw '{
    "name": "test-doc1",
    "text": "content_part_1<sep>content_part_2\tcontent_part_3\n\ncontent_part_4\ncontent_part_5",
    "indexing_technique": "high_quality",
    "doc_form": "hierarchical_model",
    "process_rule": {
        "mode": "custom",
        "rules": {
            "pre_processing_rules": [
                {"id": "remove_extra_spaces", "enabled": false},
                {"id": "remove_urls_emails", "enabled": false}
            ],
            "segmentation": {
                "separator": "whatever",
                "max_tokens": 2000,
                "parent_mode": "paragraph"
            }
        }
    }
}'

Info: #13007

✔️ Expected Behavior

No response

❌ Actual Behavior

The text was updated successfully, but these errors were encountered:

dosubot · 2025-01-24T07:54:26Z

The issue you're experiencing with segmentation recognition in the create_by_text method in parent-child mode is a known problem in Dify version 0.15.2. This issue occurs because custom segmentation settings revert to default values. To resolve this, ensure that your POST request payload is correctly structured. Specifically, the 'parent_mode' and 'subchunk_segmentation' should be placed under the 'rules' object, not directly under 'process_rule'. Here's the correct structure for your JSON payload:

{
    "name": "test-doc1",
    "text": "content_part_1<sep>content_part_2\tcontent_part_3\n\ncontent_part_4\ncontent_part_5",
    "indexing_technique": "high_quality",
    "doc_form": "hierarchical_model",
    "process_rule": {
        "mode": "hierarchical",
        "rules": {
            "pre_processing_rules": [
                {"id": "remove_extra_spaces", "enabled": false},
                {"id": "remove_urls_emails", "enabled": false}
            ],
            "segmentation": {
                "separator": "whatever",
                "max_tokens": 2000
            },
            "parent_mode": "paragraph",
            "subchunk_segmentation": {
                "separator": "***",
                "max_tokens": 500
            }
        }
    }
}

Make sure to update your payload accordingly and try again. This should help in applying the custom segmentation settings correctly ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

tigflanker · 2025-01-24T08:05:14Z

@dosu 真棒👍

dosubot · 2025-01-24T08:05:39Z

谢谢！如果还有其他问题或需要进一步帮助，请随时告诉我。祝你好运！

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

tigflanker · 2025-02-06T07:35:53Z

@dosu 请帮我在dify项目代码中寻找一下，在什么地方计算的token数量？具体就是通过API做文档上传的过程。

我上传采用的接口是：create_by_text
上传模式为：hierarchical_model
采用的embedding模型为：bge-large-zh-v1.5
这个问题的来源issue为：https://github.com/langgenius/dify/issues/13262（可以重点关注我最后一个回复内容）

dosubot · 2025-02-06T07:37:32Z

在 Dify 项目中，bge-large-zh-v1.5 嵌入模型在计算 token 数量的过程中，通过调用 GiteeAIEmbeddingModel 类的 get_num_tokens 方法来实现。这个方法使用指定的模型来计算给定文本的 token 数量。在提供的测试用例中，它用于计算文本 ["hello", "world"] 的 token 数量，预期结果是 2 个 token ^[1]。

关于 create_by_text API 的具体实现，KnowledgeBaseClient 类中的 create_document_by_text 函数负责处理该 API 请求。该函数构建一个请求到 /datasets/{self._get_dataset_id()}/document/create_by_text 端点，以通过文本创建文档 ^[2]^[3]^[4]。

至于在文档上传过程中 token 数量的计算，代码库中并没有找到明确的文档或注释来解释这个过程，特别是在使用分层模型时 ^[1]。如果需要更详细的实现细节，建议查看相关的代码实现或 GitHub issue 中的讨论。

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

dosubot bot added the 🐞 bug Something isn't working label Jan 24, 2025

tigflanker closed this as completed Jan 24, 2025

tigflanker mentioned this issue Feb 7, 2025

Suggestion: Ollama community can try using the dosubot robot. ollama/ollama#8904

Open

dosubot bot mentioned this issue Feb 26, 2025

Dify misidentified MARKDOWN file segmentation #14388

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ReOpen] Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13013

[ReOpen] Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13013

tigflanker commented Jan 24, 2025

dosubot bot commented Jan 24, 2025

tigflanker commented Jan 24, 2025

dosubot bot commented Jan 24, 2025

tigflanker commented Feb 6, 2025

dosubot bot commented Feb 6, 2025

[ReOpen] Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13013

[ReOpen] Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13013

Comments

tigflanker commented Jan 24, 2025

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

dosubot bot commented Jan 24, 2025

tigflanker commented Jan 24, 2025

dosubot bot commented Jan 24, 2025

tigflanker commented Feb 6, 2025

dosubot bot commented Feb 6, 2025