Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ReOpen] Segmentation Recognition Fails When Using create_by_text in Parent-Child Mode in the Knowledge Base API #13013

Closed
5 tasks done
tigflanker opened this issue Jan 24, 2025 · 5 comments
Labels
🐞 bug Something isn't working

Comments

@tigflanker
Copy link

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.2

Cloud or Self Hosted

Cloud

Steps to reproduce

@crazywoola Hi, I have reproduced the same issue on BOTH the cloud version and the latest version 0.15.2.

If there is a problem with my POST call, please point it out for me. Thank you.

Happy New Year.

POST:

curl --location --request POST 'https://api.dify.ai/v1/datasets/004acde6-967b-4c6e-90b7-dcf634adb392/document/create_by_text' \
--header 'Authorization: Bearer dataset-FeWv2aj6r68Y8wK91gAKiiEh' \
--header 'Content-Type: application/json' \
--data-raw '{
    "name": "test-doc1",
    "text": "content_part_1<sep>content_part_2\tcontent_part_3\n\ncontent_part_4\ncontent_part_5",
    "indexing_technique": "high_quality",
    "doc_form": "hierarchical_model",
    "process_rule": {
        "mode": "custom",
        "rules": {
            "pre_processing_rules": [
                {"id": "remove_extra_spaces", "enabled": false},
                {"id": "remove_urls_emails", "enabled": false}
            ],
            "segmentation": {
                "separator": "whatever",
                "max_tokens": 2000,
                "parent_mode": "paragraph"
            }
        }
    }
}'

Info: #13007

✔️ Expected Behavior

No response

❌ Actual Behavior

Image

@dosubot dosubot bot added the 🐞 bug Something isn't working label Jan 24, 2025
Copy link

dosubot bot commented Jan 24, 2025

The issue you're experiencing with segmentation recognition in the create_by_text method in parent-child mode is a known problem in Dify version 0.15.2. This issue occurs because custom segmentation settings revert to default values. To resolve this, ensure that your POST request payload is correctly structured. Specifically, the 'parent_mode' and 'subchunk_segmentation' should be placed under the 'rules' object, not directly under 'process_rule'. Here's the correct structure for your JSON payload:

{
    "name": "test-doc1",
    "text": "content_part_1<sep>content_part_2\tcontent_part_3\n\ncontent_part_4\ncontent_part_5",
    "indexing_technique": "high_quality",
    "doc_form": "hierarchical_model",
    "process_rule": {
        "mode": "hierarchical",
        "rules": {
            "pre_processing_rules": [
                {"id": "remove_extra_spaces", "enabled": false},
                {"id": "remove_urls_emails", "enabled": false}
            ],
            "segmentation": {
                "separator": "whatever",
                "max_tokens": 2000
            },
            "parent_mode": "paragraph",
            "subchunk_segmentation": {
                "separator": "***",
                "max_tokens": 500
            }
        }
    }
}

Make sure to update your payload accordingly and try again. This should help in applying the custom segmentation settings correctly [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@tigflanker
Copy link
Author

@dosu 真棒👍

Copy link

dosubot bot commented Jan 24, 2025

谢谢!如果还有其他问题或需要进一步帮助,请随时告诉我。祝你好运!

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@tigflanker
Copy link
Author

@dosu 请帮我在dify项目代码中寻找一下,在什么地方计算的token数量?具体就是通过API做文档上传的过程。

我上传采用的接口是:create_by_text
上传模式为:hierarchical_model
采用的embedding模型为:bge-large-zh-v1.5
这个问题的来源issue为:https://github.com/langgenius/dify/issues/13262(可以重点关注我最后一个回复内容)

Copy link

dosubot bot commented Feb 6, 2025

在 Dify 项目中,bge-large-zh-v1.5 嵌入模型在计算 token 数量的过程中,通过调用 GiteeAIEmbeddingModel 类的 get_num_tokens 方法来实现。这个方法使用指定的模型来计算给定文本的 token 数量。在提供的测试用例中,它用于计算文本 ["hello", "world"] 的 token 数量,预期结果是 2 个 token [1]

关于 create_by_text API 的具体实现,KnowledgeBaseClient 类中的 create_document_by_text 函数负责处理该 API 请求。该函数构建一个请求到 /datasets/{self._get_dataset_id()}/document/create_by_text 端点,以通过文本创建文档 [2][3][4]

至于在文档上传过程中 token 数量的计算,代码库中并没有找到明确的文档或注释来解释这个过程,特别是在使用分层模型时 [1]。如果需要更详细的实现细节,建议查看相关的代码实现或 GitHub issue 中的讨论。

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant