-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTCondor Jobs are failing when cache_task_completion = True #198
Comments
Hey @cverstege , thanks for reporting. To be honest, the
confuses me a bit. Could you check where exactly this is coming from?
Yup, I added the caching to luigi a while back for exactly that use case 😅 |
I'm not really sure to be honest. This traceback is all I have from the HTCondor jobs. This might originate from me using a local luigi scheduler? But then, this works fine (without any error message), when I just disable the task completion cache. |
I can try and setup an interactive htcondor job and reproduce this issue. But I doubt that I can get any additional useful information, as I will probably just get the same traceback. |
Looking at the traceback, So might this actually be a bug in luigi or python? |
Would it make sense in your use case to enable [luigi_worker]
cache_task_completion: $ENV_IS_LOCAL in your law.cfg with |
That would work as a workaround. Maybe this should be disabled as a default for remote jobs in law directly, though? |
@cverstege I finally had time to look deeper into this and I think I understand the problem now. However, it seems to have nothing to do with the task completion cache itself, but with the port reservation of the default multiprocessing SyncManager in busy environments (such as lxplus) where lot's of user code relies on open ports under the hood. For a similar issue, I pushed d5697ea yesterday which causes law to use a different port for its manager, that hopefully does not clash with that needed by luigi workers. Does the issue still show up with that change? If so, I would create a law-based patch that changes the luigi behavior as well. |
Thank you for looking into it. If I find the time in the coming weeks to test, I will let you know. I'm quite busy with other stuff, atm. |
When the following is set in the luigi config:
(This is e.g. recommended for running large local WFs.)
The following errors are appearing in (some) remote HTCondor jobs. I guess one single job per Worker Node is starting fine:
Disabling the task completion cache fixes this issue.
I'm using law version 0.1.20
The text was updated successfully, but these errors were encountered: