Reduce job submission count #134

pdeperio · 2017-10-31T23:31:26Z

We need to reduce the number of short jobs being submitted on Midway. Some possible solutions that may or may not be combined:

Bundling runs (which should be fine since we're not running very long pax processing anymore) so each job runs longer,
Using job arrays to reduce number of jobs scheduler handles (I think),
Running Corrections locally (this seems to be fast now after previous hax improvements) and implementing local checks for intensive processes (e.g. AddChecksum, ProcessBatchQueueHax) before submitting jobs that actually run those tasks.
Add minitrees to RunsDB, to facilitate the local checking in 3.

The text was updated successfully, but these errors were encountered:

lucrlom · 2017-11-03T15:18:37Z

So, I did a first test on datamanger adding the correction tasks together the checksum, but they take too much time because for each task run over all the runs and in the meanwhile the runs are waiting to be verified.
I'll try to create a new session of cax in parallel to check if it work with reasonable time.

lucrlom · 2017-11-03T16:30:39Z

Test definitely negative! also if with a new cax process, AddElectronLifetime and AddGains take a huge amount of time to run over all the runs. On the single run is fast, but using cax --once --config ... took very long time.
The bottle neck I think is that each task run ever all runs before to pass at next task. maybe could be more efficient do the contrary, for each run do all tasks.
In any case we saturate the RAM (MB) on datamanager computer:

             total       used       free     shared    buffers     cached
Mem:         20424      20423          0       2649          0      17792
-/+ buffers/cache:       2631      17792 
Swap:         1024       1023          0

pdeperio · 2017-11-08T14:52:27Z

I think this is related to #108 and #114, which we never understood, (i.e.:

Why is it looping over all runs per task? I thought it was looping over each task per run.
Why does it skip tasks?

Please review that issue and PR.

lucrlom · 2017-11-17T00:36:15Z

Ciao, maybe I found the way to stop the submission of thousand jobs useless with massive-cax.

Basically I made a check on the variables present on RunDB in "processor" field and I check that all entries of "correction_versions" are present.
If true the code generate the script to submit the jobs.

Of course another check is if there are the processed and the minitrees files on the local directories on midway and understand where the code is running (which host).
I have still to complete the code but since the first test it works.

XeBoris · 2017-11-19T13:12:20Z

@lucrlom I think we can raise the memory on xetransfer for the virtual machine xe1t-datamanager if it helps. I run in any case two cax-like sessions (massive-cax and massive-ruciax) with the user xe1ttransfer. Each process needs ~12 GB of memory. I haven't yet understood why these processes need so much memory (it seems a lot to me).

pdeperio assigned tunnell, XeBoris and lucrlom Oct 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce job submission count #134

Reduce job submission count #134

pdeperio commented Oct 31, 2017 •

edited

Loading

lucrlom commented Nov 3, 2017

lucrlom commented Nov 3, 2017 •

edited

Loading

pdeperio commented Nov 8, 2017

lucrlom commented Nov 17, 2017

XeBoris commented Nov 19, 2017

Reduce job submission count #134

Reduce job submission count #134

Comments

pdeperio commented Oct 31, 2017 • edited Loading

lucrlom commented Nov 3, 2017

lucrlom commented Nov 3, 2017 • edited Loading

pdeperio commented Nov 8, 2017

lucrlom commented Nov 17, 2017

XeBoris commented Nov 19, 2017

pdeperio commented Oct 31, 2017 •

edited

Loading

lucrlom commented Nov 3, 2017 •

edited

Loading