HTCondor Jobs are failing when cache_task_completion = True #198

cverstege · 2025-01-09T10:30:42Z

When the following is set in the luigi config:

[worker]
cache_task_completion: True

(This is e.g. recommended for running large local WFs.)

The following errors are appearing in (some) remote HTCondor jobs. I guess one single job per Worker Node is starting fine:

Process SyncManager-1:
Traceback (most recent call last):
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/managers.py", line 583, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/managers.py", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 453, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 596, in __init__
    self._socket.bind(address)
OSError: [Errno 98] Address already in use
ERROR: luigi-interface - Uncaught exception in luigi
Traceback (most recent call last):
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/retcodes.py", line 75, in run_with_retcodes
    worker = luigi.interface._run(argv).worker
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/interface.py", line 217, in _run
    return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
  File "/srv/job_VG35uuMJ5QCe/repo/law/law/patches.py", line 94, in _schedule_and_run
    return _schedule_and_run_orig(*args, **kwargs)
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/interface.py", line 168, in _schedule_and_run
    worker = worker_scheduler_factory.create_worker(
  File "/srv/job_VG35uuMJ5QCe/repo/law/law/patches.py", line 282, in create_worker
    worker = luigi.worker.Worker(scheduler=scheduler, worker_processes=worker_processes,
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/worker.py", line 609, in __init__
    self._task_completion_cache = multiprocessing.Manager().dict()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/managers.py", line 558, in start
    self._address = reader.recv()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError
ERROR: Uncaught exception in luigi
Traceback (most recent call last):
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/retcodes.py", line 75, in run_with_retcodes
    worker = luigi.interface._run(argv).worker
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/interface.py", line 217, in _run
    return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
  File "/srv/job_VG35uuMJ5QCe/repo/law/law/patches.py", line 94, in _schedule_and_run
    return _schedule_and_run_orig(*args, **kwargs)
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/interface.py", line 168, in _schedule_and_run
    worker = worker_scheduler_factory.create_worker(
  File "/srv/job_VG35uuMJ5QCe/repo/law/law/patches.py", line 282, in create_worker
    worker = luigi.worker.Worker(scheduler=scheduler, worker_processes=worker_processes,
  File "/srv/job_VG35uuMJ5QCe/repo/luigi/luigi/worker.py", line 609, in __init__
    self._task_completion_cache = multiprocessing.Manager().dict()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/managers.py", line 558, in start
    self._address = reader.recv()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.9.12-9a1bc/x86_64-centos7-gcc12-opt/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError
task exit code: 60

Disabling the task completion cache fixes this issue.

I'm using law version 0.1.20

The text was updated successfully, but these errors were encountered:

riga · 2025-01-09T10:36:23Z

Hey @cverstege ,

thanks for reporting.

To be honest, the

OSError: [Errno 98] Address already in use

confuses me a bit. Could you check where exactly this is coming from?

(This is e.g. recommended for running large local WFs.)

Yup, I added the caching to luigi a while back for exactly that use case 😅

cverstege · 2025-01-09T10:44:29Z

I'm not really sure to be honest. This traceback is all I have from the HTCondor jobs.
It's executing law run, which then fails with this error.

This might originate from me using a local luigi scheduler? But then, this works fine (without any error message), when I just disable the task completion cache.

cverstege · 2025-01-09T10:45:48Z

I can try and setup an interactive htcondor job and reproduce this issue. But I doubt that I can get any additional useful information, as I will probably just get the same traceback.

cverstege · 2025-01-09T10:54:06Z

Looking at the traceback, OSError: [Errno 98] Address already in use seems to be originating from pythons inbuilt multiprocessing library.
Is python using network sockets to communicate between multiple processes?
I guess this is needed for the MP safe dict used in luigi:
https://github.com/spotify/luigi/blob/b2b3b58b0b23745b15b032494cb95eb8b07c0f02/luigi/worker.py#L606-L609

So might this actually be a bug in luigi or python?

riga · 2025-01-09T11:24:26Z

Would it make sense in your use case to enable cache_task_completion only in your local env, and have it disabled in remote jobs? E.g. via

[luigi_worker]
cache_task_completion: $ENV_IS_LOCAL

in your law.cfg with ENV_IS_LOCAL defined in your setup script or so.

cverstege · 2025-01-14T14:57:01Z

That would work as a workaround. Maybe this should be disabled as a default for remote jobs in law directly, though?

riga · 2025-03-12T15:44:01Z

@cverstege I finally had time to look deeper into this and I think I understand the problem now. However, it seems to have nothing to do with the task completion cache itself, but with the port reservation of the default multiprocessing SyncManager in busy environments (such as lxplus) where lot's of user code relies on open ports under the hood.

For a similar issue, I pushed d5697ea yesterday which causes law to use a different port for its manager, that hopefully does not clash with that needed by luigi workers. Does the issue still show up with that change?

If so, I would create a law-based patch that changes the luigi behavior as well.

cverstege · 2025-03-13T09:43:37Z

Thank you for looking into it. If I find the time in the coming weeks to test, I will let you know. I'm quite busy with other stuff, atm.
But I think that this commit will fix the issue.

cverstege added the bug label Jan 9, 2025

cverstege assigned riga Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTCondor Jobs are failing when cache_task_completion = True #198

HTCondor Jobs are failing when cache_task_completion = True #198

cverstege commented Jan 9, 2025

riga commented Jan 9, 2025

cverstege commented Jan 9, 2025

cverstege commented Jan 9, 2025

cverstege commented Jan 9, 2025 •

edited

Loading

riga commented Jan 9, 2025

cverstege commented Jan 14, 2025

riga commented Mar 12, 2025

cverstege commented Mar 13, 2025

HTCondor Jobs are failing when cache_task_completion = True #198

HTCondor Jobs are failing when cache_task_completion = True #198

Comments

cverstege commented Jan 9, 2025

riga commented Jan 9, 2025

cverstege commented Jan 9, 2025

cverstege commented Jan 9, 2025

cverstege commented Jan 9, 2025 • edited Loading

riga commented Jan 9, 2025

cverstege commented Jan 14, 2025

riga commented Mar 12, 2025

cverstege commented Mar 13, 2025

cverstege commented Jan 9, 2025 •

edited

Loading