Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnection of network for workflow execution causes status stock as 'running' due to main thread kill while actual execution completes at backend thread #12798

Closed
5 tasks done
kazuhisa-wada opened this issue Jan 16, 2025 · 14 comments
Labels
🌊 feat:workflow Workflow related stuff.

Comments

@kazuhisa-wada
Copy link
Contributor

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

v.0.15.1

Cloud or Self Hosted

Cloud, Self Hosted (Docker), Self Hosted (Source)

Steps to reproduce

Disconnecting network(*) between client and dify In the execution of workflow via API or web UI (both debug run and webApp run) kills main thread and then causes status stuck as 'running', while its execution, generated by main thread, actually completes. Once workflow falls into this situation, there is no way to stop the workflow execution e.g. stop API doesn't work.

Disconnecting networks here means as follows:

  • close browser or close a tab of browser in the execution of workflow via web UI(both debug run and webApp run)
  • killing a process for API calling application by any means

This issue then causes following bad behaviors:

  • there is no logging for that workflow execution on UI even if large tokens are consumed in the executions. It of course causes that users can't see any results of the executions. This is very confusing.
  • LLM tokens are not counted although workflow completes.

This issue seems serious as the actions which causes this issue can very easily be conducted and actually this issue frequently happens in our use.

I believe this is the result of not perfect handling of network disconnection which can cause main thread killing right now.

✔️ Expected Behavior

  • Regardless of network disconnection, actual workflow execution result should be logged in the database so uses can see the results if workflow normally ends.
  • Regardless of network disconnection, workflow execution can be stopped via workflow stop API

❌ Actual Behavior

Described above.

@dosubot dosubot bot added the 🌊 feat:workflow Workflow related stuff. label Jan 16, 2025
@kazuhisa-wada
Copy link
Contributor Author

I believe the root causes of this problem are as follows:

  1. db update when each node run finishes exists at main thread side. DB status update on each node run is trigerred once event message is received which is published by backend thread.
  2. main thread is killed when network disconnection happens(maybe TCP FIN is received)
  3. log info is not displayed in log tab on UI if status is still running. This spec is a bit strange and so it should be displayed separately from 1-2 above.

@kazuhisa-wada
Copy link
Contributor Author

Sorry I forgot to mention about if stream or not for API case. What I tested was only streaming mode. I have not tested for non-streaming case but I believe it works well as non-streaming mode doesn't write anything on streaming channel.

@WeakX
Copy link

WeakX commented Jan 17, 2025

I've ran into this issue too. Connection may be closed during stream resonding, then main thread will raise a GeneratorExit by flask

For the reason that the main thread is also responsible for DB operations, once it exits the workflow record(as well as workflow_node_execution record, message record, etc) will no longer be updated.

This only happens in streaming mode.

I think it will be better to decouple the db-operation and response from the main thread.

@kazuhisa-wada
Copy link
Contributor Author

kazuhisa-wada commented Jan 17, 2025

Sounds good. Here is my understandings:

main thread

  1. create threads for node execution
  2. listen message from backend execution to receive event via python standard queue
  3. update the message when message is received
  4. write the response when message is received or ping event happens

backend thread

  1. execute node execution
  2. publish message(i.e. create event and throw it) via python standard queue

Better way is to move no.3 to additional other new thread.

@WeakX
Copy link

WeakX commented Jan 17, 2025

Sounds good. Here is my understandings:

main thread

  1. create threads for node execution
  2. listen message from backend execution to receive event via python standard queue
  3. update the message when message is received
  4. write the response when message is received or ping event happens

backend thread

  1. execute node execution
  2. publish message(i.e. create event and throw it) via python standard queue

Better way is to move no.3 to additional other new thread.

agreed

Copy link

dosubot bot commented Feb 17, 2025

Hi, @kazuhisa-wada. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • Network disconnection during workflow execution causes the main thread to terminate.
  • Workflow status remains 'running' despite backend completion, with no logging or API control.
  • @WeakX confirmed the issue, noting it occurs in streaming mode due to main thread handling.
  • Suggested resolution: Decouple DB operations from the main thread.

Next Steps:

  • Is this issue still relevant to the latest version of the Dify repository? If so, please comment to keep the discussion open.
  • Otherwise, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 17, 2025
@kazuhisa-wada
Copy link
Contributor Author

Is this issue still relevant to the latest version of the Dify repository? If so, please comment to keep the discussion open.

Definitely Yes. I also wanted to emphasize that this issue is a bit critical in terms of customer experience and so should be dealt with for v1.0.0 although this may be a difficult problem.

@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 17, 2025
Copy link

dosubot bot commented Feb 17, 2025

@takatost, the user @kazuhisa-wada has confirmed that this issue is still relevant and critical for customer experience, especially for v1.0.0. Could you please assist them with this matter?

@Zerglingzl
Copy link

I encountered the same issue and sincerely hope that Dify can resolve it as soon as possible; otherwise, it will significantly impact our product’s user experience.

My situation might be somewhat unique, yet it makes reproducing the issue even easier—basically, it occurs over 50% of the time.

The business logic is as follows:
1. I use WeChat cloud functions to call Dify’s workflow API. Once the workflow starts running, to avoid waiting too long for a result, the cloud function exits and is terminated.
2. I use WeChat cloud functions to poll the workflow status periodically. If the status becomes “succeed,” I retrieve the result.

As a result, the problem mentioned in “thread #12798” keeps occurring, causing the workflow status to remain stuck on “running.” In reality, however, the workflow has completed (log analysis shows that the final step has produced results), and the tokens for the large model have already been consumed, which is extremely frustrating.

BTW, I'm using SaaS Dify 0.15.3.

@kazuhisa-wada @WeakX @takatost

@kazuhisa-wada
Copy link
Contributor Author

Confirmation just in case, closed as it's been consolidated to #14362 ???

@Zerglingzl
Copy link

so far as I experienced, the issue remains in Cloud version 0.15.3.

@Zerglingzl
Copy link

For my understanding, "running" forever is not a good ending nevertheless. when the API caller END the connection (abnormal or in purperse), the workflow status should reflex the TURE state of workflow. which means, if it runs to the end (in fact it is, and tokens have been consumed), the status should be "succeed".

@Zerglingzl
Copy link

update: I tried Ali cloud functions yesterday, unfortunately the issue REMAINS the same. it seems that this issue affects ALL BAAS type of backend caller, including weChat cloud functions, Ali cloud functions, Amazon cloud functions, etc. Our business stuck in that bug now, please fix it or any suggestions , thanks a lot!

@Zerglingzl
Copy link

update2: call back (when workflow is done) would be the perfect solution. Of cause that means a lot of work for dify engineer...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🌊 feat:workflow Workflow related stuff.
Projects
None yet
Development

No branches or pull requests

4 participants