Disconnection of network for workflow execution causes status stock as 'running' due to main thread kill while actual execution completes at backend thread #12798

kazuhisa-wada · 2025-01-16T09:36:33Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

Dify version

v.0.15.1

Cloud or Self Hosted

Cloud, Self Hosted (Docker), Self Hosted (Source)

Steps to reproduce

Disconnecting network(*) between client and dify In the execution of workflow via API or web UI (both debug run and webApp run) kills main thread and then causes status stuck as 'running', while its execution, generated by main thread, actually completes. Once workflow falls into this situation, there is no way to stop the workflow execution e.g. stop API doesn't work.

Disconnecting networks here means as follows:

close browser or close a tab of browser in the execution of workflow via web UI(both debug run and webApp run)
killing a process for API calling application by any means

This issue then causes following bad behaviors:

there is no logging for that workflow execution on UI even if large tokens are consumed in the executions. It of course causes that users can't see any results of the executions. This is very confusing.
LLM tokens are not counted although workflow completes.

This issue seems serious as the actions which causes this issue can very easily be conducted and actually this issue frequently happens in our use.

I believe this is the result of not perfect handling of network disconnection which can cause main thread killing right now.

✔️ Expected Behavior

Regardless of network disconnection, actual workflow execution result should be logged in the database so uses can see the results if workflow normally ends.
Regardless of network disconnection, workflow execution can be stopped via workflow stop API

❌ Actual Behavior

Described above.

kazuhisa-wada · 2025-01-16T12:39:44Z

I believe the root causes of this problem are as follows:

db update when each node run finishes exists at main thread side. DB status update on each node run is trigerred once event message is received which is published by backend thread.
main thread is killed when network disconnection happens(maybe TCP FIN is received)
log info is not displayed in log tab on UI if status is still running. This spec is a bit strange and so it should be displayed separately from 1-2 above.

kazuhisa-wada · 2025-01-16T12:42:14Z

Sorry I forgot to mention about if stream or not for API case. What I tested was only streaming mode. I have not tested for non-streaming case but I believe it works well as non-streaming mode doesn't write anything on streaming channel.

WeakX · 2025-01-17T08:57:26Z

I've ran into this issue too. Connection may be closed during stream resonding, then main thread will raise a GeneratorExit by flask

For the reason that the main thread is also responsible for DB operations, once it exits the workflow record(as well as workflow_node_execution record, message record, etc) will no longer be updated.

This only happens in streaming mode.

I think it will be better to decouple the db-operation and response from the main thread.

kazuhisa-wada · 2025-01-17T15:31:45Z

Sounds good. Here is my understandings:

main thread

create threads for node execution
listen message from backend execution to receive event via python standard queue
update the message when message is received
write the response when message is received or ping event happens

backend thread

execute node execution
publish message(i.e. create event and throw it) via python standard queue

Better way is to move no.3 to additional other new thread.

WeakX · 2025-01-17T15:38:17Z

Sounds good. Here is my understandings:

main thread

create threads for node execution

listen message from backend execution to receive event via python standard queue

update the message when message is received

write the response when message is received or ping event happens

backend thread

execute node execution

publish message(i.e. create event and throw it) via python standard queue

Better way is to move no.3 to additional other new thread.

agreed

dosubot · 2025-02-17T01:33:07Z

Hi, @kazuhisa-wada. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary:

Network disconnection during workflow execution causes the main thread to terminate.
Workflow status remains 'running' despite backend completion, with no logging or API control.
@WeakX confirmed the issue, noting it occurs in streaming mode due to main thread handling.
Suggested resolution: Decouple DB operations from the main thread.

Next Steps:

Is this issue still relevant to the latest version of the Dify repository? If so, please comment to keep the discussion open.
Otherwise, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

kazuhisa-wada · 2025-02-17T06:11:49Z

Is this issue still relevant to the latest version of the Dify repository? If so, please comment to keep the discussion open.

Definitely Yes. I also wanted to emphasize that this issue is a bit critical in terms of customer experience and so should be dealt with for v1.0.0 although this may be a difficult problem.

dosubot · 2025-02-17T06:14:00Z

@takatost, the user @kazuhisa-wada has confirmed that this issue is still relevant and critical for customer experience, especially for v1.0.0. Could you please assist them with this matter?

Zerglingzl · 2025-02-26T01:47:27Z

I encountered the same issue and sincerely hope that Dify can resolve it as soon as possible; otherwise, it will significantly impact our product’s user experience.

My situation might be somewhat unique, yet it makes reproducing the issue even easier—basically, it occurs over 50% of the time.

The business logic is as follows:
1. I use WeChat cloud functions to call Dify’s workflow API. Once the workflow starts running, to avoid waiting too long for a result, the cloud function exits and is terminated.
2. I use WeChat cloud functions to poll the workflow status periodically. If the status becomes “succeed,” I retrieve the result.

As a result, the problem mentioned in “thread #12798” keeps occurring, causing the workflow status to remain stuck on “running.” In reality, however, the workflow has completed (log analysis shows that the final step has produced results), and the tokens for the large model have already been consumed, which is extremely frustrating.

BTW, I'm using SaaS Dify 0.15.3.

@kazuhisa-wada @WeakX @takatost

kazuhisa-wada · 2025-02-26T08:02:12Z

Confirmation just in case, closed as it's been consolidated to #14362 ???

Zerglingzl · 2025-02-26T08:05:47Z

so far as I experienced, the issue remains in Cloud version 0.15.3.

Zerglingzl · 2025-02-26T08:11:43Z

For my understanding, "running" forever is not a good ending nevertheless. when the API caller END the connection (abnormal or in purperse), the workflow status should reflex the TURE state of workflow. which means, if it runs to the end (in fact it is, and tokens have been consumed), the status should be "succeed".

Zerglingzl · 2025-02-27T02:00:36Z

update: I tried Ali cloud functions yesterday, unfortunately the issue REMAINS the same. it seems that this issue affects ALL BAAS type of backend caller, including weChat cloud functions, Ali cloud functions, Amazon cloud functions, etc. Our business stuck in that bug now, please fix it or any suggestions , thanks a lot!

Zerglingzl · 2025-02-27T02:45:52Z

update2: call back (when workflow is done) would be the perfect solution. Of cause that means a lot of work for dify engineer...

dosubot bot added the 🌊 feat:workflow Workflow related stuff. label Jan 16, 2025

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 17, 2025

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 17, 2025

Zerglingzl mentioned this issue Feb 26, 2025

Workflow status STUCK as "running" with weChat Cloud functions, while the result is already successfully generated in fact. please help fix~ #14362

Open

5 tasks

crazywoola closed this as completed Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disconnection of network for workflow execution causes status stock as 'running' due to main thread kill while actual execution completes at backend thread #12798

Disconnection of network for workflow execution causes status stock as 'running' due to main thread kill while actual execution completes at backend thread #12798

kazuhisa-wada commented Jan 16, 2025

kazuhisa-wada commented Jan 16, 2025

kazuhisa-wada commented Jan 16, 2025

WeakX commented Jan 17, 2025 •

edited

Loading

kazuhisa-wada commented Jan 17, 2025 •

edited

Loading

WeakX commented Jan 17, 2025

dosubot bot commented Feb 17, 2025

kazuhisa-wada commented Feb 17, 2025

dosubot bot commented Feb 17, 2025

Zerglingzl commented Feb 26, 2025

kazuhisa-wada commented Feb 26, 2025

Zerglingzl commented Feb 26, 2025

Zerglingzl commented Feb 26, 2025

Zerglingzl commented Feb 27, 2025

Zerglingzl commented Feb 27, 2025

Disconnection of network for workflow execution causes status stock as 'running' due to main thread kill while actual execution completes at backend thread #12798

Disconnection of network for workflow execution causes status stock as 'running' due to main thread kill while actual execution completes at backend thread #12798

Comments

kazuhisa-wada commented Jan 16, 2025

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

kazuhisa-wada commented Jan 16, 2025

kazuhisa-wada commented Jan 16, 2025

WeakX commented Jan 17, 2025 • edited Loading

kazuhisa-wada commented Jan 17, 2025 • edited Loading

WeakX commented Jan 17, 2025

dosubot bot commented Feb 17, 2025

kazuhisa-wada commented Feb 17, 2025

dosubot bot commented Feb 17, 2025

Zerglingzl commented Feb 26, 2025

kazuhisa-wada commented Feb 26, 2025

Zerglingzl commented Feb 26, 2025

Zerglingzl commented Feb 26, 2025

Zerglingzl commented Feb 27, 2025

Zerglingzl commented Feb 27, 2025

WeakX commented Jan 17, 2025 •

edited

Loading

kazuhisa-wada commented Jan 17, 2025 •

edited

Loading