Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce job submission count #134

Open
pdeperio opened this issue Oct 31, 2017 · 5 comments
Open

Reduce job submission count #134

pdeperio opened this issue Oct 31, 2017 · 5 comments
Assignees

Comments

@pdeperio
Copy link
Contributor

pdeperio commented Oct 31, 2017

We need to reduce the number of short jobs being submitted on Midway. Some possible solutions that may or may not be combined:

  1. Bundling runs (which should be fine since we're not running very long pax processing anymore) so each job runs longer,

  2. Using job arrays to reduce number of jobs scheduler handles (I think),

  3. Running Corrections locally (this seems to be fast now after previous hax improvements) and implementing local checks for intensive processes (e.g. AddChecksum, ProcessBatchQueueHax) before submitting jobs that actually run those tasks.

  4. Add minitrees to RunsDB, to facilitate the local checking in 3.

@lucrlom
Copy link
Contributor

lucrlom commented Nov 3, 2017

So, I did a first test on datamanger adding the correction tasks together the checksum, but they take too much time because for each task run over all the runs and in the meanwhile the runs are waiting to be verified.
I'll try to create a new session of cax in parallel to check if it work with reasonable time.

@lucrlom
Copy link
Contributor

lucrlom commented Nov 3, 2017

Test definitely negative! also if with a new cax process, AddElectronLifetime and AddGains take a huge amount of time to run over all the runs. On the single run is fast, but using cax --once --config ... took very long time.
The bottle neck I think is that each task run ever all runs before to pass at next task. maybe could be more efficient do the contrary, for each run do all tasks.
In any case we saturate the RAM (MB) on datamanager computer:

             total       used       free     shared    buffers     cached
Mem:         20424      20423          0       2649          0      17792
-/+ buffers/cache:       2631      17792 
Swap:         1024       1023          0 

@pdeperio
Copy link
Contributor Author

pdeperio commented Nov 8, 2017

I think this is related to #108 and #114, which we never understood, (i.e.:

  1. Why is it looping over all runs per task? I thought it was looping over each task per run.
  2. Why does it skip tasks?

Please review that issue and PR.

@lucrlom
Copy link
Contributor

lucrlom commented Nov 17, 2017

Ciao, maybe I found the way to stop the submission of thousand jobs useless with massive-cax.

Basically I made a check on the variables present on RunDB in "processor" field and I check that all entries of "correction_versions" are present.
If true the code generate the script to submit the jobs.

Of course another check is if there are the processed and the minitrees files on the local directories on midway and understand where the code is running (which host).
I have still to complete the code but since the first test it works.

@XeBoris
Copy link
Contributor

XeBoris commented Nov 19, 2017

@lucrlom I think we can raise the memory on xetransfer for the virtual machine xe1t-datamanager if it helps. I run in any case two cax-like sessions (massive-cax and massive-ruciax) with the user xe1ttransfer. Each process needs ~12 GB of memory. I haven't yet understood why these processes need so much memory (it seems a lot to me).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants