Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTCondor Jobs are failing when cache_task_completion = True #198

Open
cverstege opened this issue Jan 9, 2025 · 8 comments
Open

HTCondor Jobs are failing when cache_task_completion = True #198

cverstege opened this issue Jan 9, 2025 · 8 comments
Assignees
Labels

Comments

@cverstege
Copy link
Contributor

When the following is set in the luigi config:

[worker]
cache_task_completion: True

(This is e.g. recommended for running large local WFs.)

The following errors are appearing in (some) remote HTCondor jobs. I guess one single job per Worker Node is starting fine:

Process SyncManager-1:
Traceback (most recent call last):
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/managers.py", line 583, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/managers.py", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 453, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 596, in __init__
    self._socket.bind(address)
OSError: [Errno 98] Address already in use
ERROR: luigi-interface - Uncaught exception in luigi
Traceback (most recent call last):
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/retcodes.py", line 75, in run_with_retcodes
    worker = luigi.interface._run(argv).worker
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/interface.py", line 217, in _run
    return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
  File "/srv/job_VG35uuMJ5QCe/repo/law/law/patches.py", line 94, in _schedule_and_run
    return _schedule_and_run_orig(*args, **kwargs)
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/interface.py", line 168, in _schedule_and_run
    worker = worker_scheduler_factory.create_worker(
  File "/srv/job_VG35uuMJ5QCe/repo/law/law/patches.py", line 282, in create_worker
    worker = luigi.worker.Worker(scheduler=scheduler, worker_processes=worker_processes,
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/worker.py", line 609, in __init__
    self._task_completion_cache = multiprocessing.Manager().dict()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/managers.py", line 558, in start
    self._address = reader.recv()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError
ERROR: Uncaught exception in luigi
Traceback (most recent call last):
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/retcodes.py", line 75, in run_with_retcodes
    worker = luigi.interface._run(argv).worker
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/interface.py", line 217, in _run
    return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
  File "/srv/job_VG35uuMJ5QCe/repo/law/law/patches.py", line 94, in _schedule_and_run
    return _schedule_and_run_orig(*args, **kwargs)
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/interface.py", line 168, in _schedule_and_run
    worker = worker_scheduler_factory.create_worker(
  File "/srv/job_VG35uuMJ5QCe/repo/law/law/patches.py", line 282, in create_worker
    worker = luigi.worker.Worker(scheduler=scheduler, worker_processes=worker_processes,
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/worker.py", line 609, in __init__
    self._task_completion_cache = multiprocessing.Manager().dict()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/managers.py", line 558, in start
    self._address = reader.recv()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError
task exit code: 60

Disabling the task completion cache fixes this issue.

I'm using law version 0.1.20

@riga
Copy link
Owner

riga commented Jan 9, 2025

Hey @cverstege ,

thanks for reporting.

To be honest, the

OSError: [Errno 98] Address already in use

confuses me a bit. Could you check where exactly this is coming from?


(This is e.g. recommended for running large local WFs.)

Yup, I added the caching to luigi a while back for exactly that use case 😅

@cverstege
Copy link
Contributor Author

I'm not really sure to be honest. This traceback is all I have from the HTCondor jobs.
It's executing law run, which then fails with this error.

This might originate from me using a local luigi scheduler? But then, this works fine (without any error message), when I just disable the task completion cache.

@cverstege
Copy link
Contributor Author

I can try and setup an interactive htcondor job and reproduce this issue. But I doubt that I can get any additional useful information, as I will probably just get the same traceback.

@cverstege
Copy link
Contributor Author

cverstege commented Jan 9, 2025

Looking at the traceback, OSError: [Errno 98] Address already in use seems to be originating from pythons inbuilt multiprocessing library.
Is python using network sockets to communicate between multiple processes?
I guess this is needed for the MP safe dict used in luigi:
https://github.com/spotify/luigi/blob/b2b3b58b0b23745b15b032494cb95eb8b07c0f02/luigi/worker.py#L606-L609

So might this actually be a bug in luigi or python?

@riga
Copy link
Owner

riga commented Jan 9, 2025

Would it make sense in your use case to enable cache_task_completion only in your local env, and have it disabled in remote jobs? E.g. via

[luigi_worker]
cache_task_completion: $ENV_IS_LOCAL

in your law.cfg with ENV_IS_LOCAL defined in your setup script or so.

@cverstege
Copy link
Contributor Author

That would work as a workaround. Maybe this should be disabled as a default for remote jobs in law directly, though?

@riga
Copy link
Owner

riga commented Mar 12, 2025

@cverstege I finally had time to look deeper into this and I think I understand the problem now. However, it seems to have nothing to do with the task completion cache itself, but with the port reservation of the default multiprocessing SyncManager in busy environments (such as lxplus) where lot's of user code relies on open ports under the hood.

For a similar issue, I pushed d5697ea yesterday which causes law to use a different port for its manager, that hopefully does not clash with that needed by luigi workers. Does the issue still show up with that change?

If so, I would create a law-based patch that changes the luigi behavior as well.

@cverstege
Copy link
Contributor Author

Thank you for looking into it. If I find the time in the coming weeks to test, I will let you know. I'm quite busy with other stuff, atm.
But I think that this commit will fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants